# DS3000 Lesson 3

Logistic
- Homework 1 due yesterday
- Homework 2 will be posted on Tuesday
- Make sure you have checked your group assignment
- Modules needed today (`seaborn, pandas`); if they are not installed, run (in terminal):
    - `pip install seaborn pandas`

Content
- numpy & arrays
- pandas
    - series
    - dataframe
- seaborn

# Why do we make such a fuss to represent data as arrays?
Its often a convenient analogy to consider a dataset as a big table.  A dataset describes the **features** of a collection of **samples**:
- each row represents a sample
    - e.g. a penguin
- each column represents a feature
    - e.g. how heavy the penguins are
- the intersection of a row and column contains the feature of the sample
    - e.g. how heavy a particular penguin is
    
<img src="https://imgur.com/orZWHly.png" width=300 />


In [1]:
# (we'll cover this code later, for now I just want us all to
# look at a dataset together)

import seaborn as sns

# data source: https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv
df_penguin = sns.load_dataset('penguins')
df_penguin.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


Why represent data in 2d arrays?
- many datasets well encapsulated as a 2d array with 
    - different rows used for samples
    - different col used feature
- Arrays (matrices) are natural math objects in linear algebra, probability and statistics all of which underpin machine learning.

## Rows vs Columns
<img src="https://learnenglishfunway.com/wp-content/uploads/2021/07/Row-vs-Column.jpg" width=700 />

row: an entry, an observation, a person, an object
column: feature/factor/variable/attribute

two most important way to classify the features:
1. response/target: something you are interested/you want to predict
2. feature/explanary variabe: somehting you use to explain/predict the target

1. numerical variable
2. character variable

## **NumPy** (**Numerical Python**) Library
* First appeared in 2006 and is the **preferred Python array implementation**.
* High-performance, richly functional **_n_-dimensional array** type called **`ndarray`**. 
* **Written in C** and **up to 100 times faster than lists**.
* Critical in big-data processing, AI applications and much more. 
* According to `libraries.io`, **over 450 Python libraries depend on NumPy**. 
* Many popular data science libraries such as Pandas, SciPy (Scientific Python) and Keras (for deep learning) are built on or depend on NumPy. 

Big Question:
```
What is an array/matrix?  (and how is different than a list or list of lists?)
```

| Array                                 | List (Python: Dynamic Array)                         |
|---------------------------------------|------------------------------------------------------|
| Size is static (contiguous memory)    | Size can be modified quickly (non-contiguous memory) |
| Quick to compute (esp Linear Algebra) | Slower to compute (and clumsy looking code)          |
| contains 1 datatype (numeric)         | may contain many data types (need not be numeric)    |

### Initializing arrays:
- 1d from list / tuple
- 2d from list / tuple

In [2]:
import numpy as np

# generate a 1D array with length of 3
np.array((1,2,3))

array([1, 2, 3])

In [4]:
np.array(([1,2,3],[4,5,6]))

array([[1, 2, 3],
       [4, 5, 6]])

### Building some special matrices
- zeros
- ones
- full 
- identity


<img src="https://learnenglishfunway.com/wp-content/uploads/2021/07/Row-vs-Column.jpg" width=200 />

#### Convention: Rows First!
- we describe array shape as `(n_rows, n_cols)`
- we index into an array as `x[row_idx, col_idx]`

In [6]:
np.zeros((5,2))

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [7]:
np.ones((2,5), dtype = int)

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

In [8]:
np.full(shape = (2,5), fill_value = 2.0)

array([[2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.]])

In [11]:
# identy matrix

np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## Arrays which change: 
- `.arange()`
- `.linspace()`
- `.geomspace()`
- `.logspace()`

In [13]:
# np.arange(start(inclusive), stop(exclusive), step size )
np.arange(0,10)

array([0, 2, 4, 6, 8])

In [14]:
# np.linspace(start(inclusive), stop(inclusive), size)
np.linspace(0,1,7)

array([0.        , 0.16666667, 0.33333333, 0.5       , 0.66666667,
       0.83333333, 1.        ])

In [16]:
# np.geomspace(start(inclusive), stop(inclusive), size)
np.geomspace(1,27,4)

array([ 1.        ,  5.19615242, 27.        ])

In [17]:
# np.logspace(10^start, 10^stop, size)
np.logspace(0,2,3)

array([  1.,  10., 100.])

### Array Attributes
- shape
- size
- ndim

Numpy can build arrays out of many different number types (bool, int, float).  ([see also](https://numpy.org/doc/stable/user/basics.types.html#:~:text=There%20are%205%20basic%20numerical,point%20(float)%20and%20complex.&text=NumPy%20knows%20that%20int%20refers,int_%20%2C%20bool%20means%20np.))
- dtype
    - astype
- nbytes

In [26]:
x = np.array(([1,2,3],[4,5,6]))

In [27]:
x.dtype

dtype('int64')

In [20]:
x.ndim

2

In [21]:
x.shape

(2, 3)

In [23]:
x.size

6

In [24]:
x.nbytes

48

In [25]:
x_low = np.array(([1,2,3],[4,5,6]), np.uint8)
x_low.nbytes

6

## Manipulating array shape

### Diagonal

The diagonal of each array is shaded below, the unshaded elements are not on the diagonal of the matrix:

$$ \begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare\\
\square & \square & \square\\
\end{bmatrix} 
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square & \square & \square\\
\square & \blacksquare & \square& \square & \square\\
\square & \square & \blacksquare& \square & \square\end{bmatrix}
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare
\end{bmatrix} 
$$

### Numpy methods
- transpose
- `.reshape()`
    - order of reshape (row or column first?)

In [28]:
x = np.array([[1, 2, 3],
              [4, 5, 6]]) 
x

array([[1, 2, 3],
       [4, 5, 6]])

In [29]:
# transpose
y = x.T
y

array([[1, 4],
       [2, 5],
       [3, 6]])

In [32]:
x.reshape((1,6))

array([[1, 2, 3, 4, 5, 6]])

In [None]:
# reshape must have the same number of total elements

In [34]:
z = np.arange(0,12)
z

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [35]:
z.reshape((3,4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [36]:
z.reshape((3,-1))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

## Array Indexing (slicing)

You can index arrays, everything we've previously shown about `start:stop:step` indexing works for arrays too!

In [37]:
x = np.arange(11)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [38]:
x[2:6]

array([2, 3, 4, 5])

In [41]:
x[5:]

array([ 5,  6,  7,  8,  9, 10])

In [42]:
x[:5]

array([0, 1, 2, 3, 4])

A two dimensional array requires two indices to get a value: `x[row_idx, col_idx]`

(Just like our convention for rows first in shape, the row index comes first as we index into the array)

In [44]:
x = np.arange(20).reshape((4, 5))
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [45]:
# row index = 1, second row
# column index = 2, third column
x[1,2]

np.int64(7)

In [48]:
# rows 0 and 1 
# column 2
x[0:2,2]

array([2, 7])

In [50]:
x[0:2, 0:3]

array([[0, 1, 2],
       [5, 6, 7]])

## Super useful slice syntax on arrays:
(so useful it deserves its own title)

In [39]:
# we can use this to get an entire rows or columns as needed
x = np.arange(20).reshape((4, 5))
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [51]:
x[:,0]

array([ 0,  5, 10, 15])

In [52]:
x[1,:]

array([5, 6, 7, 8, 9])

In [54]:
x[:,-2:]

array([[ 3,  4],
       [ 8,  9],
       [13, 14],
       [18, 19]])

### Computing stats on an array
- `.sum()`
- `.min()`
- `.max()`
- `.mean()`
- `.std()`
    - standard deviation
- `.var()`
    - variance
- `.argmin()`
    - index of item which is smallest
- `.argmax()`
    - index of item which is largest

In [7]:
y = np.arange(100, 112).reshape((3, 4))
y

array([[100, 101, 102, 103],
       [104, 105, 106, 107],
       [108, 109, 110, 111]])

## Why are we doing this again?

<img src="https://imgur.com/orZWHly.png" width=300 />

In [10]:
import seaborn as sns

# data source: https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv
df_penguin = sns.load_dataset('penguins')
df_penguin.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Array Operations: 
- array and a scalar: 
    - apply operation to every element of array
- array and array: 
    - apply operation to corresponding elements of arrays (requires shape or [special](https://numpy.org/doc/stable/user/basics.broadcasting.html) structure)


In [18]:
y1 = np.arange(12).reshape((3, 4))
print(y)
print(y1)

[[100 101 102 103]
 [104 105 106 107]
 [108 109 110 111]]
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]


# Pandas

Pandas is a python module which stores data.  

### If we already have `np.array()`, why do we need pandas?
- pandas supports non numeric data (strings for categorical data, for example)
- pandas supports reading / storing data from more formats
    - csv (spreadsheets)
- pandas more elegantly deals with missing data
- pandas handles indexing woes

You could do almost everything pandas does with numpy arrays ... but it would be much more difficult to accomplish.

### Pandas has two essential objects:
- **dataframe**
    - 2 dimensional data structure
    - you've already seen one today!  (we replicate below)
- **series (vectors)**
    - 1 dimensional data structure, each item associated with some index
    - you could store the weight of all the penguins as a series 
        - (all samples of one feature)
    - you could store the weight, bill size, sex, island, etc for a single penguin as a series
        - (all features for one sample)

In [61]:
import seaborn as sns

# df stands for dataframe.  df_penguin is a dataframe of penguin data
df_penguin = sns.load_dataset('penguins')
df_penguin.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Pandas Series

Pandas series contain a sequence of labelled data elements:
- penguin0's `species` is `Adelie`
- penguin0's `island` is `Torgersen`
- penguin0's `bill_length_mm` is `39.1` ...
- penguin0's `<index-name>` is `<corresponding-value>`

A series is quite similar to a dictionary ...

In [59]:
penguin0_dict = {'species': 'Adelie',
 'island': 'Torgersen',
 'bill_length_mm': 39.1,
 'bill_depth_mm': 18.7,
 'flipper_length_mm': 181.0,
 'body_mass_g': 3750.0,
 'sex': 'Male'}

In [60]:
import pandas as pd

# build a series from dict
penguin0_series = pd.Series(penguin0_dict)
penguin0_series

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       Male
dtype: object

In [62]:
index = ['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex']
values = ['Adelie', 'Torgersen', 39.1, 18.7, 181.0, 3750.0, 'Male']
penguin0_series = pd.Series(values, index=index)
penguin0_series

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       Male
dtype: object

In [63]:
ice_cream= 'vanilla', 'chocolate', 'cherry', 'coffee'
pd.Series(ice_cream)

0      vanilla
1    chocolate
2       cherry
3       coffee
dtype: object

### accessing / changing data
- accessing custom index:
    - by name: `series.loc[name]`
    - by position: `series.iloc[idx]`
- iterating: keys, items, iteritems (much like dict)
- deleting an entry

In [26]:
dict_fav_num = {'Eric':  17, 'Xiaoyi': 8, 'Lynn': 3, 'Tamrat': 1}
series_fav_num = pd.Series(dict_fav_num)
series_fav_num

Eric      17
Xiaoyi     8
Lynn       3
Tamrat     1
dtype: int64

### Describing a `pd.Series`

Just like numpy arrays:
- `Series.argmin()`
    - which index has smallest value
    - pandas gives the row number, not the index
- `Series.argmax()`
    - which index has largest value
    - pandas gives the row number, not the index
- `Series.mean()`
- `Series.min()`
- `Series.max()`
- `Series.std()`
- `Series.var()`

New to pandas:
- `Series.count()`
    - number of item pairs in series
- `Series.describe()`
    - summary statistics

## Pandas: DataFrame

Remember:
- `Series`:  1d data object
- `DataFrame`: 2d data object

`DataFrame`s represent two-dimensional data, for example, grades:

|           | Quiz 0 | Quiz 1 | Quiz 2 |
|-----------|--------|--------|--------|
| Student 0 | 80     | 90     | 50     |
| Student 1 | 87     | 92     | 80     |

Each column or row above could be considered a `Series` object (as we'll see later, we can indeed extract a single row or column of a dataframe as a `Series` object).

In [85]:
import pandas as pd
import numpy as np

quiz_array = np.array([[80, 90, 50],
                 [87, 92, 80]])

df_quiz = pd.DataFrame(quiz_array, 
                       columns=('quiz0', 'quiz1', 'quiz2'), 
                       index=('student0', 'student1'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [86]:
# we construct a dataframe as a dictionary
# keys of the dictionary are columns of dataframe
# values are lists (or tuples) of the values in each column
quiz_dict = {'quiz0': [80, 87],
            'quiz1': [90, 92],
            'quiz2': [50, 80]}
pd.DataFrame(quiz_dict, index=('student0', 'student1'))

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [87]:
# another way to construct, this time the transpose
quiz_dict2 = {'student0': [80, 90, 50],
             'student1': [87, 92, 80]}
pd.DataFrame(quiz_dict2, index=('quiz0', 'quiz1', 'quiz2'))

Unnamed: 0,student0,student1
quiz0,80,87
quiz1,90,92
quiz2,50,80


### Describing a `pd.DataFrame`

Similar to Series (for now), but with a couple differences:
- `DataFrame.iloc[].argmin()` or `DataFrame.loc[].argmin()` 
    - note that this does not work on the DataFrame itself, but on specified series/rows
    - which index has smallest value
    - pandas gives the row number, not the index
- `DataFrame.iloc[].argmax()` or `DataFrame.loc[].argmax()`
    - note that this does not work on the DataFrame itself, but on specified series/rows
    - which index has largest value
    - pandas gives the row number, not the index
- `DataFrame.mean()`
- `DataFrame.min()`
- `DataFrame.max()`
- `DataFrame.std()`
- `DataFrame.var()`

New to pandas:
- `DataFrame.count()`
    - number of item pairs in series
- `DataFrame.describe()`
    - summary statistics

## Indexing / Accessing a DataFrame
- indexing: 
    - `.loc[]` indexing by name of row or column
    - `.iloc[]` indexing by position integer (0, 1, 2, 3, 4 ...)
    & slicing & subsets
- using `:` to get full rows or columns
- single cell's contents: `at`, `iat` & slicing

## Modifying a DataFrame
- updating values: single cell
- adding a new column: `pd.DataFrame.concat()`

In [None]:
df_quiz

In [105]:
# adding a column (next 2 cells) more error robust way of handling indexing
# by explicitly labelling the index we're sure to match more explicitly
s_overgrade = pd.Series({'student1': 'b-',
                         'student0': 'a+',
                        'student2': 'f (no quizzes taken)'})
s_overgrade

student1                      b-
student0                      a+
student2    f (no quizzes taken)
dtype: object

In [108]:
# rebuild df_quiz
quiz_dict = {'quiz0': [80, 87],
            'quiz1': [90, 92],
            'quiz2': [50, 80]}
df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [109]:
# notice: name of series ends up on index of dataframe
# notice: order of items in series doesnt matter, they're aligned by index
s_student3 = pd.Series({'quiz1': 90,
                        'quiz2': 100,
                        'quiz0': 95},
                      name='student3')
s_student3

quiz1     90
quiz2    100
quiz0     95
Name: student3, dtype: int64

### Operating on DataFrame & Series Objects

Your operators do what you'd expect them to:

### Boolean Indexing into DataFrame

Sometimes we want to grab only the rows or columns which meet a particular condition.

"Get all students whose grade was higher than 85 on quiz 1"

In [114]:
quiz_dict = {'quiz0': [80, 87, 60, 30],
            'quiz1': [90, 92, 60, 23],
            'quiz2': [50, 80, 70, 64]}
df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1', 'student2', 'student3'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80
student2,60,60,70
student3,30,23,64


# Loading Data into Pandas

Data comes from many places:
- Web Scraping
- Application Program Interface (API)
- SQL
- local file:
    - csv
    - JSON
    - fixed width tables (HTML)
    
### Pandas functions which load data
| Mode | Description
| ------ | :------
| **`read_csv`** | Load comma seperated values data from a file or URL (other delimeters too!)
| **`read_xlsx`** | Read data in xls format (Mircosoft Excel)
| **`read-fwf`** | Read data in fixed-width column format (i.e., no delimiters such as tab-separated txt files)
| **`read_clipboard`** | Version of read_csv that reads data from the clipboard; useful for converting tables from web pages
| **`read_html`** | Read all tables contained in the given HTML document.
| **`read_json`** | Read data from a JSON (JavaScript Object Notation) string representation

## Reading CSV into Pandas
- read_csv
- index_col
- header

In [119]:
# note: file must be in same folder as jupyter notebook
pd.read_csv('cleaner_gtky.csv')

Unnamed: 0,fake_student_id,time_stamp,class,co_op,prog_exp,python_exp,java_exp,r_exp,c_exp,age_months,ideal_start_salary_thousands
0,1380,09-09-22 15:37,Sophomore,No,9,Python,,,,234.0,60.0
1,3926,09-09-22 16:01,Sophomore,No,7,Python,,,,233.0,100.0
2,2394,09-09-22 14:19,Junior,Yes,7,Python,Java,,,252.0,70.0
3,4827,09-09-22 16:07,Junior,No,7,Python,Java,,,243.0,60.0
4,9977,09-09-22 16:06,Sophomore,No,5,Python,,R,,231.0,90.0
...,...,...,...,...,...,...,...,...,...,...,...
92,3775,09-09-22 15:50,Junior,Yes,8,Python,,R,,262.0,100.0
93,1562,09-09-22 16:06,Sophomore,No,7,Java,,,,230.0,73.0
94,9610,09-09-22 14:13,Senior,Yes,10,Python,Java,,,264.0,100.0
95,2120,09-09-22 13:46,Junior,Yes,2,Java,,,,246.0,100.0


## Saving a DataFrame as a csv
- .to_csv()
- index=False
- header=False
- appending to csv (mode='a', header=None)

In [125]:
# why would you want to not save the header?

The example above seems a bit contrived, but imagine you have a web-scraping job which goes to some financial web page every hour and scrapes it to get some new data (more to come on this later).  You could just add the new data as a few new rows to your existing dataset with the syntax shown above.