# DS3000 Day 1

Sep 12, 2023

Admin
- Qwickly Attendance (PIN on board)
- Style Guide on Canvas
- Modules needed today `pip install collections seaborn numpy pandas` (you may not *need* to do all these, but why not?)
- Homework 1 due next Tues, Sep 19 by midnight

Push-Up Tracker
- Section 04: 0
- Section 05: 2
- Section 06: 0

Content
- finish up basic python
- numpy & arrays
- pandas
    - series
    - dataframe

### Question from last class: Can you call a key from a dictionary based on a value?

Yes! There are a couple ways, but perhaps most simply, make use of `list()`, `.keys()`, `.values()`, and `.index()` and you can do it in one line (could also break this up into parts):

In [1]:
dict_a = {'first': 10, 'second': ["a", "list"], 'third': True}

# for those who may need a second to process the below:
# take the keys of dict_a and make them into a list
# take the values of dict_a and make them into a list
# get the index of the value that corresponds to True
# use square brackets to get the key corresponding to that index
list(dict_a.keys())[list(dict_a.values()).index(True)]

'third'

In [2]:
l = [1,2,'pi']
l.index('pi')

2

## Where we left off: Functions

Let's briefly finish off talking about functions, which can be a useful way of helping you clean up your data collection/analysis pipeline.

In [3]:
# this module allows you to initialize an empty dictionary
from collections import defaultdict

# another example function
def create_dict(featurenames, features):
    """ creates a dictionary with keys and values corresponding to the inputs

    Args:
        featurenames (list): a list of strings that will serve as the keys to the dictionary
        features (list): a list of values (any type) that will serve as the values of the dictionary

    Returns:
        dict (dictionary): the dictionary
    """

    # create an empty dictionary that will take lists as values
    dict = defaultdict(list)
    
    try:
        for i in range(len(featurenames)):
            # for each feature name, create a key-value pair in the dictionary
            dict[str(featurenames[i])] = features[i]

            # if age is a feature, create a logical column for an age 30 cut-off
            if featurenames[i] == "age":
                dict["old"] = [i > 30 for i in features[i]]
                
    except:
        # if something doesn't work above, let us know
        print("Failure to create dictionary; check inputs")

    return dict

In [5]:
test = ["bloodtype", "age", "height"]
vals = [["A", "B", "AB", "AB+"], [28, 31, 16, 88], [180, 194, 136, 178]]

dict = create_dict(test, vals)
print(dict)

defaultdict(<class 'list'>, {'bloodtype': ['A', 'B', 'AB', 'AB+'], 'age': [28, 31, 16, 88], 'old': [False, True, False, True], 'height': [180, 194, 136, 178]})


# LECTURE BREAK/PRACTICE

The Standard Creepiness Rule states that you shouldn't date anyone younger than $\frac{Your Age}{2} + 7$:

![dating](https://imgs.xkcd.com/comics/dating_pools.png)

Spend 5-10 minutes writing a function that takes as input a person's age, and outputs the sentence: `"You can date people at least {result} years old."` Where `{result}` is the result of calculating the lower bound of the SRC.

Then, define a variable `neighbor_age = int(input())` and pass it to your function after your neighbor enters their age.

**note 1**: we are using `int(input())` because by default `input()` takes only strings and we are going to be passing our function an integer.

**note 2**: the solutions to this are in `Day0_BreakSolution` on Canvas.

In [15]:
def min_age(age=int(input())):
    """ computes a lower bound for the age of someone you want to date based on your age
        args: user's age (int)
        returns: formatted string of youngest age user should date
    """
    min_age = age/2 + 7
    return f'You can date people at least {min_age} years old.'

min_age()

21


'You can date people at least 17.5 years old.'

## `import`-ing 

We can use software from [Python's standard library](https://docs.python.org/3/library/index.html) or anything on [pypi](https://pypi.org/) (e.g. pandas, numpy, scipy, matplotlib ....this is where folks share their python software) by `import`-ing it into our code. We already imported the math module earlier to be able to use $\pi$ and the collections module to use `defaultdict()`.

(If its on pypi you must install it via pip before use, everything in the standard library comes with Python itself).

Lets build a random number generator:

In [4]:
# imports the whole library
import random

# you can access these functions as attributes of 'random'
random.choices(['pick me', 'or me', 'maybe me?'])

['maybe me?']

In [5]:
# import just the function you'd like from the library
from random import choices

# importing as immediately above, you may only use the functions you've explicitly imported
choices(['pick me', 'or me', 'maybe me?'])

['pick me']

Veteran programming instinct: wondering (and checking) "Does this function also do this other behavior?"

Go to the [official documentation](https://docs.python.org/3/library/random.html) and skim it to see what else you can do with a function.

In [6]:
# we can add a weight parameter which changes how often one item is selected vs another
# if weight of one item is twice the other, it will be picked twice as often
choices(['pick me', 'or me', 'maybe me?'], weights=[1, 2, 3])

['or me']

In [7]:
# or select multiple items by passing k argument
choices(['pick me', 'or me', 'maybe me?'], k=10, weights=[1, 2, 3])

['maybe me?',
 'or me',
 'maybe me?',
 'maybe me?',
 'maybe me?',
 'maybe me?',
 'maybe me?',
 'maybe me?',
 'or me',
 'maybe me?']

# Representing data as arrays (or, matrices, if you prefer, though "array" is more general)
Its often a convenient analogy to consider a dataset as a big table.  A dataset describes the **features** of a collection of **samples**:
- each row represents a sample (or, observation)
    - e.g. a penguin
- each column represents a feature (or, variable)
    - e.g. how heavy the penguins are
- the intersection of a row and column contains the feature of the sample
    - e.g. how heavy a particular penguin is
    
<img src="https://imgur.com/orZWHly.png" width=300 />


In [8]:
# (we'll cover this code later, for now I just want us all to
# look at a dataset together)
import seaborn as sns

# data source: https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv
df_penguin = sns.load_dataset('penguins')
df_penguin.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


Why represent data in 2d arrays?
- many datasets well encapsulated as a 2d array with 
    - different rows used for samples
    - different col used feature
- Arrays (matrices) are natural math objects in linear algebra, probability and statistics all of which underpin machine learning.

## Rows vs Columns
<img src="https://learnenglishfunway.com/wp-content/uploads/2021/07/Row-vs-Column.jpg" width=700 />

## **NumPy** (**Numerical Python**) Library
* First appeared in 2006 and is the **preferred Python array implementation**.
* High-performance, richly functional **_n_-dimensional array** type called **`ndarray`**. 
* **Written in C** and **up to 100 times faster than lists**.
* Critical in big-data processing, AI applications and much more. 
* According to `libraries.io`, **over 450 Python libraries depend on NumPy**. 
* Many popular data science libraries such as Pandas, SciPy (Scientific Python) and Keras (for deep learning) are built on or depend on NumPy. 

Big Question:
```
What is an array/matrix?  (and how is different than a list or list of lists?)
```

| Array                                 | List (Python: Dynamic Array)                         |
|---------------------------------------|------------------------------------------------------|
| Size is static (contiguous memory)    | Size can be modified quickly (non-contiguous memory) |
| Quick to compute (esp Linear Algebra) | Slower to compute (and clumsy looking code)          |
| contains 1 datatype (numeric)         | may contain many data types (need not be numeric)    |

### Initializing arrays:
- 1d from list / tuple
- 2d from list / tuple

In [9]:
import numpy as np

# x is a 1d array (3)
x = np.array((1, 2, 3))
x

array([1, 2, 3])

In [10]:
# y is a 2d array (2, 3)
y = np.array([[1, 2, 3],
              [4, 5, 6]])
y

array([[1, 2, 3],
       [4, 5, 6]])

### Building some special matrices
- zeros
- ones
- full 
- identity


<img src="https://learnenglishfunway.com/wp-content/uploads/2021/07/Row-vs-Column.jpg" width=200 />

#### Convention: Rows First!
- we describe array shape as `(n_rows, n_cols)`
- we index into an array as `x[row_idx, col_idx]`

In [11]:
# shape = (n_rows, n_cols)
# shape = (height, width)
# .zeros gives an array of all zeros
z = np.zeros((5, 2)) # tall array
z

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [12]:
# .ones gives an array of all ones
one_array = np.ones((2, 5), dtype=int)
one_array

array([[1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1]])

In [13]:
# can use .full to create an array of all fill_value
# np.full(shape=(2,5), fill_value=2)
two_array = np.full(shape=(2, 5), fill_value=2.0)
two_array

array([[2., 2., 2., 2., 2.],
       [2., 2., 2., 2., 2.]])

In [14]:
# identity matrix
# 1's on the diagonal, 0s elsewhere
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

## Arrays which change: 
- `.arange()`
- `.linspace()`
- `.geomspace()`
- `.logspace()`

In [15]:
# note: not "arrange", but rather "(a)rray (range)"
# np.arange(start (inclusive), stop (exclusive), step)
np.arange(0, 10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [16]:
# linearly spaced values np.linspace(start (inclusive), stop (inclusive), size)
np.linspace(0, 1, 7)

array([0.        , 0.16666667, 0.33333333, 0.5       , 0.66666667,
       0.83333333, 1.        ])

In [17]:
# geom spaced values np.geomspace(start (inclusive), stop (inclusive), size)
np.geomspace(1, 27, 4)

array([ 1.,  3.,  9., 27.])

In [18]:
# log spaced value np.logspace(start_exp, stop_exp, size)
# start = 10^start_exp, stop = 10^stop_exp
np.logspace(0, 2, 3)

array([  1.,  10., 100.])

### Array Attributes
- shape
- size
- ndim

Numpy can build arrays out of many different number types (bool, int, float).  ([see also](https://numpy.org/doc/stable/user/basics.types.html#:~:text=There%20are%205%20basic%20numerical,point%20(float)%20and%20complex.&text=NumPy%20knows%20that%20int%20refers,int_%20%2C%20bool%20means%20np.))
- dtype
    - astype
- nbytes

In [19]:
x = np.array([[1, 2, 3],
              [4, 5, 6]]) 

In [20]:
# whether you see int32 or int64 depends on the bit size; it should not truly matter
x.dtype

dtype('int32')

In [21]:
# ndim is the (n)umber of (dim)ensions
x.ndim

2

In [22]:
# shape gives the values of the dimensions
x.shape

(2, 3)

In [23]:
# size is total number of elements
x.size

6

In [24]:
x.nbytes

24

In [25]:
# converting the type of an array can help lower the memory demands
# https://numpy.org/doc/stable/user/basics.types.html
x_low_mem = np.array([[1, 2, 3],
                      [4, 5, 6]], np.uint8)
x_low_mem.nbytes

6

## Manipulating array shape

### Diagonal

The diagonal of each array is shaded below, the unshaded elements are not on the diagonal of the matrix:

$$ \begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare\\
\square & \square & \square\\
\end{bmatrix} 
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square & \square & \square\\
\square & \blacksquare & \square& \square & \square\\
\square & \square & \blacksquare& \square & \square\end{bmatrix}
\hspace{2cm}
\begin{bmatrix}
\blacksquare & \square & \square\\
\square & \blacksquare & \square\\
\square & \square & \blacksquare
\end{bmatrix} 
$$

### Numpy methods
- transpose
- `.reshape()`
    - order of reshape (row or column first?)

In [26]:
x = np.array([[1, 2, 3],
              [4, 5, 6]]) 
x

array([[1, 2, 3],
       [4, 5, 6]])

In [27]:
# transpose (.T): flip across the diagonal
y = x.T
y

array([[1, 4],
       [2, 5],
       [3, 6]])

In [28]:
# reshape allows us to change shape of matrix by defining the dimensions
x.reshape((1, 6))

array([[1, 2, 3, 4, 5, 6]])

In [29]:
# (new matrix must have same total number of elements)
# x.reshape((1, 8))

In [30]:
z = np.arange(0, 12)
z

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [31]:
z.reshape((3, 4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [32]:
# -1 may be used at most once in the shape argument; it is used to tell python to
# choose the value to ensure output array has same number of elements
z.reshape((3, -1))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [33]:
# be mindful that -1 can be replaced by some integer to keep same number of elements in array
# but you need to make sure that the multiplication works!
# z.reshape((5, -1))

In [34]:
# we can fill the array across the rows first (order='C'), which is the default
z.reshape((3, 4), order='C')

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [35]:
# or down columns first (order='F')
z.reshape((3, 4), order='F')

array([[ 0,  3,  6,  9],
       [ 1,  4,  7, 10],
       [ 2,  5,  8, 11]])

## Array Indexing (slicing)

You can index arrays, everything we've previously shown about `start:stop:step` indexing works for arrays too!

In [36]:
x = np.arange(11)
x

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [37]:
x[5]

5

In [38]:
x[2:6]

array([2, 3, 4, 5])

In [39]:
x[-3:]

array([ 8,  9, 10])

In [40]:
x[:5]

array([0, 1, 2, 3, 4])

A two dimensional array requires two indices to get a value: `x[row_idx, col_idx]`

(Just like our convention for rows first in shape, the row index comes first as we index into the array)

In [18]:
import numpy as np
x = np.arange(20).reshape((4, 5))
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [22]:
x[:1]

array([[0, 1, 2, 3, 4]])

In [42]:
# row_idx=1 (second row since python starts counting at 0)
# col_idx=2 (third row since python starts counting at 0)
x[1, 2]

7

In [43]:
# we can start:stop:step slice either index

# get a slice of rows and a constant column
x[0:2, 2]

array([2, 7])

In [44]:
# get a slice of columns and a constant row
x[2, 0:2]

array([10, 11])

## Super useful slice syntax on arrays:
(so useful it deserves its own title)

In [45]:
# by default, the slice indexing chooses start:stop to give the entire object
x = np.array([1, 2, 3])
x[:]

array([1, 2, 3])

In [46]:
# we can use this to get an entire rows or columns as needed
x = np.arange(20).reshape((4, 5))
x

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

In [47]:
# get the first column
x[:, 0]

array([ 0,  5, 10, 15])

In [48]:
# get the second row
x[1, :]

array([5, 6, 7, 8, 9])

In [49]:
# get the last two columns
x[:, -2:]

array([[ 3,  4],
       [ 8,  9],
       [13, 14],
       [18, 19]])

### Computing stats on an array
- `.sum()`
- `.min()`
- `.max()`
- `.mean()`
- `.std()`
    - standard deviation
- `.var()`
    - variance
- `.argmin()`
    - index of item which is smallest
- `.argmax()`
    - index of item which is largest

In [50]:
x = np.array([4, 3, 5, 4])
x

array([4, 3, 5, 4])

In [51]:
# get index of smallest item
# (smallest item, 3, is at index 1)
x.argmin()

1

In [52]:
y = np.arange(100, 112).reshape((3, 4))
y

array([[100, 101, 102, 103],
       [104, 105, 106, 107],
       [108, 109, 110, 111]])

In [53]:
y.sum(), y.min(), y.max(), y.mean(), y.std(), y.var()

(1266, 100, 111, 105.5, 3.452052529534663, 11.916666666666666)

In [54]:
# axis: which of the shape parameters should I operate on? 
# shape = (shape0, shape1)

# axis=0 averages across different rows to give the column average
y.mean(axis=0)

array([104., 105., 106., 107.])

In [55]:
# axis=1 averages across different columns to give the row average
y.mean(axis=1)

array([101.5, 105.5, 109.5])

In [56]:
# axis is an accepted keyword of all methods listed above
y.min(axis=1)

array([100, 104, 108])

In [57]:
y.min(axis=0)

array([100, 101, 102, 103])

## Why are we doing this again?

<img src="https://imgur.com/orZWHly.png" width=300 />

In [58]:
import seaborn as sns

# data source: https://github.com/mwaskom/seaborn-data/blob/master/penguins.csv
df_penguin = sns.load_dataset('penguins')
df_penguin.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Array Operations: 
- array and a scalar: 
    - apply operation to every element of array
- array and array: 
    - apply operation to corresponding elements of arrays (requires shape or [special](https://numpy.org/doc/stable/user/basics.broadcasting.html) structure)


In [24]:
y1 = np.arange(12).reshape((3, 4))
y1

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [60]:
y1 + 3

array([[ 3,  4,  5,  6],
       [ 7,  8,  9, 10],
       [11, 12, 13, 14]])

In [61]:
y1 * 10

array([[  0,  10,  20,  30],
       [ 40,  50,  60,  70],
       [ 80,  90, 100, 110]])

In [62]:
y1 ** 2

array([[  0,   1,   4,   9],
       [ 16,  25,  36,  49],
       [ 64,  81, 100, 121]])

In [63]:
# array and array arithmetic
y2 = np.arange(100, 112).reshape((3, 4))
y2

array([[100, 101, 102, 103],
       [104, 105, 106, 107],
       [108, 109, 110, 111]])

In [64]:
# array and array arithmetic applies operation to corresponding items in arrays
y1 + y2

array([[100, 102, 104, 106],
       [108, 110, 112, 114],
       [116, 118, 120, 122]])

In [65]:
y1 * y2

array([[   0,  101,  204,  309],
       [ 416,  525,  636,  749],
       [ 864,  981, 1100, 1221]])

In [66]:
y1

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [67]:
# (++) adding a constant row (x) to all rows of a matrix (y1)
# more details here: https://numpy.org/doc/stable/user/basics.broadcasting.html
x = np.array([1000, 2000, 3000, 4000])
y1 + x

array([[1000, 2001, 3002, 4003],
       [1004, 2005, 3006, 4007],
       [1008, 2009, 3010, 4011]])

# 5-minute Break Time! Ask me questions about NumPy, or go get water, or stretch, or just veg out for a few minutes.

# Pandas

Pandas is a python module which stores data.  

### If we already have `np.array()`, why do we need pandas?
- pandas supports non numeric data (strings for categorical data, for example)
- pandas supports reading / storing data from more formats
    - csv (spreadsheets)
- pandas more elegantly deals with missing data
- pandas handles indexing woes

You could do almost everything pandas does with numpy arrays ... but it would be much more difficult to accomplish.

### Pandas has two essential objects:
- **dataframe**
    - 2 dimensional data structure
    - you've already seen one today!  (we replicate below)
- **series (vectors)**
    - 1 dimensional data structure, each item associated with some index
    - you could store the weight of all the penguins as a series 
        - (all samples of one feature)
    - you could store the weight, bill size, sex, island, etc for a single penguin as a series
        - (all features for one sample)

In [68]:
import seaborn as sns

# df stands for dataframe.  df_penguin is a dataframe of penguin data
df_penguin = sns.load_dataset('penguins')
df_penguin.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [69]:
# the table above is a dataframe
type(df_penguin)

pandas.core.frame.DataFrame

## Pandas Series
### building:
- building: default index
- building: custom index
- building: from a dict

In [70]:
# each row, or column is a series object
# this represents first row of dataframe
penguin0_series = df_penguin.iloc[0, :]
penguin0_series

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       Male
Name: 0, dtype: object

Pandas series contain a sequence of labelled data elements:
- penguin0's `species` is `Adelie`
- penguin0's `island` is `Torgersen`
- penguin0's `bill_length_mm` is `39.1` ...
- penguin0's `<index-name>` is `<corresponding-value>`

A series is quite similar to a dictionary ...

In [71]:
penguin0_dict = {'species': 'Adelie',
 'island': 'Torgersen',
 'bill_length_mm': 39.1,
 'bill_depth_mm': 18.7,
 'flipper_length_mm': 181.0,
 'body_mass_g': 3750.0,
 'sex': 'Male'}

In [72]:
import pandas as pd

# build a series from dict
penguin0_series = pd.Series(penguin0_dict)
penguin0_series

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       Male
dtype: object

In [73]:
# you can also pass two corresponding lists / tuples
index = ['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex']
values = ['Adelie', 'Torgersen', 39.1, 18.7, 181.0, 3750.0, 'Male']

penguin0_series = pd.Series(values, index=index)
penguin0_series

species                 Adelie
island               Torgersen
bill_length_mm            39.1
bill_depth_mm             18.7
flipper_length_mm        181.0
body_mass_g             3750.0
sex                       Male
dtype: object

In [74]:
# sometimes your data has no meaningful index
# pandas will default to indexing things with integers
ice_cream_flavors = 'vanilla', 'chocolate', 'cherry garcia', 'oatmeal'
pd.Series(ice_cream_flavors)

0          vanilla
1        chocolate
2    cherry garcia
3          oatmeal
dtype: object

In [75]:
# you can access values via .values
penguin0_series.values

array(['Adelie', 'Torgersen', 39.1, 18.7, 181.0, 3750.0, 'Male'],
      dtype=object)

In [76]:
# you can access index via .index
penguin0_series.index

Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
       'flipper_length_mm', 'body_mass_g', 'sex'],
      dtype='object')

### Using Series (hey, don't they look a little like dictionaries?)
- casting dictionaries to series
- accessing custom index:
    - by name: `series.loc[name]`
    - by position: `series.iloc[idx]`
- iterating: keys, items, iteritems (much like dict)
- deleting an entry

In [77]:
dict_fav_num = {'eric':  17, 'qi': 7, 'lynne': 3, 'tamrat': 1}
series_fav_num = pd.Series(dict_fav_num)
series_fav_num

eric      17
qi         7
lynne      3
tamrat     1
dtype: int64

In [78]:
# lookup value associated with index='eric'
series_fav_num['eric']

17

In [79]:
# another way to lookup value associated with index='eric'
series_fav_num.loc['eric']

17

In [80]:
# return the value in position 2 (the third position, from top)
series_fav_num.iloc[1]

7

In [81]:
# each of these access methods can also set the value
series_fav_num.iloc[2] = 1000
series_fav_num

eric        17
qi           7
lynne     1000
tamrat       1
dtype: int64

In [82]:
for idx in series_fav_num.index:
    print(idx)

eric
qi
lynne
tamrat


In [83]:
# check membership of item in index
'eric' in series_fav_num.index

True

In [84]:
'alice' in series_fav_num.index

False

In [85]:
# iterating through values
for val in series_fav_num.values:
    print(val)

17
7
1000
1


In [86]:
# iterating through index (just a like dict!)
# .keys() produces same thing as .index
for key in series_fav_num.keys():
    print(key)

eric
qi
lynne
tamrat


In [87]:
# iterating through index, value pairs (just like dict!)
for key, val in series_fav_num.items():
    print(key, val)

eric 17
qi 7
lynne 1000
tamrat 1


In [88]:
# removing a pair by its corresponding index (just like dict!)
del series_fav_num['eric']

In [89]:
series_fav_num

qi           7
lynne     1000
tamrat       1
dtype: int64

### Describing a `pd.Series`

Just like numpy arrays:
- `Series.argmin()`
    - which index has smallest value
    - pandas gives the row number, not the index
- `Series.argmax()`
    - which index has largest value
    - pandas gives the row number, not the index
- `Series.mean()`
- `Series.min()`
- `Series.max()`
- `Series.std()`
- `Series.var()`

New to pandas:
- `Series.count()`
    - number of item pairs in series
- `Series.describe()`
    - summary statistics

In [90]:
# number of entries (rows)
series_fav_num.count()

3

In [91]:
# other descriptors/summary statistics
series_fav_num.describe()

count       3.000000
mean      336.000000
std       575.048694
min         1.000000
25%         4.000000
50%         7.000000
75%       503.500000
max      1000.000000
dtype: float64

## Pandas: DataFrame

Remember:
- `Series`:  1d data object
- `DataFrame`: 2d data object

`DataFrame`s represent two-dimensional data, for example, grades:

|           | Quiz 0 | Quiz 1 | Quiz 2 |
|-----------|--------|--------|--------|
| Student 0 | 80     | 90     | 50     |
| Student 1 | 87     | 92     | 80     |

Each column or row above could be considered a `Series` object (as we'll see later, we can indeed extract a single row or column of a dataframe as a `Series` object).

In [2]:
import pandas as pd
import numpy as np

quiz_array = np.array([[80, 90, 50],
                 [87, 92, 80]])

df_quiz = pd.DataFrame(quiz_array, 
                       columns=('quiz0', 'quiz1', 'quiz2'), 
                       index=('student0', 'student1'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [3]:
# we construct a dataframe as a dictionary
# keys of the dictionary are columns of dataframe
# values are lists (or tuples) of the values in each column
quiz_dict = {'quiz0': [80, 87],
            'quiz1': [90, 92],
            'quiz2': [50, 80]}
pd.DataFrame(quiz_dict, index=('student0', 'student1'))

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [94]:
# another way to construct, this time the transpose
quiz_dict2 = {'student0': [80, 90, 50],
             'student1': [87, 92, 80]}
pd.DataFrame(quiz_dict2, index=('quiz0', 'quiz1', 'quiz2'))

Unnamed: 0,student0,student1
quiz0,80,87
quiz1,90,92
quiz2,50,80


In [95]:
# could also use .transpose() or even .T
df_quiz.transpose()
# df_quiz.T

Unnamed: 0,student0,student1
quiz0,80,87
quiz1,90,92
quiz2,50,80


In [96]:
# we can also add the column or index names after creation
df_quiz = pd.DataFrame(quiz_array)
df_quiz

Unnamed: 0,0,1,2
0,80,90,50
1,87,92,80


In [97]:
df_quiz.columns = ['quiz0', 'quiz1', 'quiz2']
df_quiz.index = ('student0', 'student1')
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


### Describing a `pd.DataFrame`

Similar to Series (for now), but with a couple differences:
- `DataFrame.iloc[].argmin()` or `DataFrame.loc[].argmin()` 
    - note that this does not work on the DataFrame itself, but on specified series/rows
    - which index has smallest value
    - pandas gives the row number, not the index
- `DataFrame.iloc[].argmax()` or `DataFrame.loc[].argmax()`
    - note that this does not work on the DataFrame itself, but on specified series/rows
    - which index has largest value
    - pandas gives the row number, not the index
- `DataFrame.mean()`
- `DataFrame.min()`
- `DataFrame.max()`
- `DataFrame.std()`
- `DataFrame.var()`

New to pandas:
- `DataFrame.count()`
    - number of item pairs in series
- `DataFrame.describe()`
    - summary statistics

In [98]:
# by default, each method applies operation to entire column of data
# note that there may be some complaining from python if not all the columns are numeric
df_quiz.mean()

quiz0    83.5
quiz1    91.0
quiz2    65.0
dtype: float64

In [99]:
# we can also pass axis parameter to specify if operation should be applied to row or column
# !remember!
# axis=0 -> apply operation across all rows (returns operation per col)
# axis=1 -> apply operation across all cols (returns operation per row)
df_quiz.mean(axis=1)

student0    73.333333
student1    86.333333
dtype: float64

In [100]:
# describe only works on columns
df_quiz.describe()

Unnamed: 0,quiz0,quiz1,quiz2
count,2.0,2.0,2.0
mean,83.5,91.0,65.0
std,4.949747,1.414214,21.213203
min,80.0,90.0,50.0
25%,81.75,90.5,57.5
50%,83.5,91.0,65.0
75%,85.25,91.5,72.5
max,87.0,92.0,80.0


## Indexing / Accessing a DataFrame
- indexing: 
    - `.loc[]` indexing by name of row or column
    - `.iloc[]` indexing by position integer (0, 1, 2, 3, 4 ...)
    & slicing & subsets
- using `:` to get full rows or columns
- single cell's contents: `at`, `iat` & slicing

In [101]:
# indexing data by "name"
# remember: rows first, then columns ... 
# 1st entry describes which row ('student0')
# 2nd entry describes which col ('quiz0')

df_quiz.loc['student0', 'quiz0']

80

In [102]:
# index data by position
# 1st entry describes which row.  0 -> the 1st (topmost) row
# 2nd entry describes which col.  2 -> the 3rd (from the left) col
df_quiz.iloc[0, 2]

50

In [103]:
# you can use same slicing syntaxes on both .loc and .iloc
# 1st row, last col
df_quiz.iloc[0, -1]

50

In [104]:
# all rows, only the second col
df_quiz.iloc[:, 1]

student0    90
student1    92
Name: quiz1, dtype: int32

In [105]:
# all rows, only quiz0
df_quiz.loc[:, 'quiz0']

student0    80
student1    87
Name: quiz0, dtype: int32

In [106]:
# slicing with named cols and rows
# you can get a range, by name of row/col
# note: this includes both start and stop columns
df_quiz.loc['student0', 'quiz0':'quiz2' ]

quiz0    80
quiz1    90
quiz2    50
Name: student0, dtype: int32

In [107]:
# watch out:
# when you get ranges indexed by position: include start idx, exclude stop idx)
df_quiz.iloc[0, 0:2 ]

quiz0    80
quiz1    90
Name: student0, dtype: int32

In [108]:
# if you access directly into dataframe, it will assume you're looking for a column
# (below is equivilent to df_quiz.loc[:, 'quiz0'])
df_quiz['quiz0']

student0    80
student1    87
Name: quiz0, dtype: int32

## Modifying a DataFrame
- updating values: single cell
- adding a new column
- `pd.DataFrame.append()`
    - adds a single row to a dataframe
    - deprecated, but works for now. will eventually be replaced with `pd.DataFrame.concat()`

In [4]:
# setting single entry in dataframe
df_quiz.loc['student0', 'quiz1'] = 123
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,123,50
student1,87,92,80


In [5]:
# adding a new column (which student got which grade?)
# notice data frames can include columns of multiple types!
df_quiz['overall grade'] = 'a', 'b' 
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2,overall grade
student0,80,123,50,a
student1,87,92,80,b


In [6]:
# Can make a copy of dataframe using .copy() without overriding the original df
df_test = df_quiz.copy()
df_test

Unnamed: 0,quiz0,quiz1,quiz2,overall grade
student0,80,123,50,a
student1,87,92,80,b


In [111]:
# delete a column
del df_quiz['overall grade']
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,123,50
student1,87,92,80


In [112]:
# adding a column (next 2 cells) more error robust way of handling indexing
# by explicitly labelling the index we're sure to match more explicitly
s_overgrade = pd.Series({'student1': 'b-',
                         'student0': 'a+',
                        'student2': 'f (no quizzes taken)'})
s_overgrade

student1                      b-
student0                      a+
student2    f (no quizzes taken)
dtype: object

In [113]:
# notice how pandas helps us out in aligning our new column with proper row
# (and avoids including student2)
df_quiz['overall grade'] = s_overgrade
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2,overall grade
student0,80,123,50,a+
student1,87,92,80,b-


In [114]:
# how to 'drop' a row (returns a dataframe with row removed)
df_quiz_short = df_quiz.drop('student0')
df_quiz_short

Unnamed: 0,quiz0,quiz1,quiz2,overall grade
student1,87,92,80,b-


In [115]:
# rebuild df_quiz
quiz_dict = {'quiz0': [80, 87],
            'quiz1': [90, 92],
            'quiz2': [50, 80]}
df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [116]:
# notice: name of series ends up on index of dataframe
# notice: order of items in series doesnt matter, they're aligned by index
s_student3 = pd.Series({'quiz1': 90,
                        'quiz2': 100,
                        'quiz0': 95},
                      name='student3')
s_student3

quiz1     90
quiz2    100
quiz0     95
Name: student3, dtype: int64

In [117]:
# add new row to dataframe
# doesnt modify original df
pd.concat([df_quiz, s_student3.to_frame().T])

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80
student3,95,90,100


In [118]:
# also notice: .concat() returns a copy of df_quiz, it isn't modified above
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80


In [119]:
# thus, must overwrite:
df_quiz = pd.concat([df_quiz, s_student3.to_frame().T])
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80
student3,95,90,100


In [7]:
# adding a column that is a function of other columns:
df_quiz['average'] = round((df_quiz.quiz0 + df_quiz.quiz1 + df_quiz.quiz2)/3, 3)
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2,overall grade,average
student0,80,123,50,a,84.333
student1,87,92,80,b,86.333


### Operating on DataFrame & Series Objects

Your operators do what you'd expect them to:

In [121]:
df_quiz * 1000

Unnamed: 0,quiz0,quiz1,quiz2,average
student0,80000,90000,50000,73333.333333
student1,87000,92000,80000,86333.333333
student3,95000,90000,100000,95000.0


In [122]:
# we can also use comparison operators (super helpful, see boolean indexing next)
df_quiz >= 85

Unnamed: 0,quiz0,quiz1,quiz2,average
student0,False,True,False,False
student1,True,True,False,True
student3,True,True,True,True


### Boolean Indexing into DataFrame

Sometimes we want to grab only the rows or columns which meet a particular condition.

"Get all students whose grade was higher than 85 on quiz 1"

In [8]:
quiz_dict = {'quiz0': [80, 87, 60, 30],
            'quiz1': [90, 92, 60, 23],
            'quiz2': [50, 80, 70, 64]}
df_quiz = pd.DataFrame(quiz_dict, index=('student0', 'student1', 'student2', 'student3'))
df_quiz

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student1,87,92,80
student2,60,60,70
student3,30,23,64


In [10]:
# quiz 2 is a series object which contains every index's quiz 1 grade
# one, verbose way:
s_quiz2 = df_quiz.loc[:, 'quiz2']
s_quiz2

student0    50
student1    80
student2    70
student3    64
Name: quiz2, dtype: int64

In [125]:
# easier way, but not as convenient for seeing what we're about to do
df_quiz.quiz2

student0    50
student1    80
student2    70
student3    64
Name: quiz2, dtype: int64

In [12]:
# boolean indexing: using a boolean series as index returns only those entries which are True
# notice that since student2 & student3's quiz1 grade wasn't > 80 they aren't included below
df_quiz.loc[s_quiz2 < 70, :]

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50
student3,30,23,64


In [13]:
# we can build more complex conditions using 
# & (and operator)
# | (or operator)

# all students who got at least an 80 on quiz1 and they scored < 75 on quiz2
s_bool = (df_quiz.loc[:, 'quiz1'] > 80) & (df_quiz.loc[:, 'quiz2']  < 75)
s_bool

student0     True
student1    False
student2    False
student3    False
dtype: bool

In [128]:
df_quiz.loc[s_bool, :]

Unnamed: 0,quiz0,quiz1,quiz2
student0,80,90,50


# Loading Data into Pandas

Data comes from many places:
- Web Scraping
- Application Program Interface (API)
- SQL
- local file:
    - csv
    - JSON
    - fixed width tables (HTML)
    
### Pandas functions which load data
| Mode | Description
| ------ | :------
| **`read_csv`** | Load comma seperated values data from a file or URL (other delimeters too!)
| **`read_xlsx`** | Read data in xls format (Mircosoft Excel)
| **`read-fwf`** | Read data in fixed-width column format (i.e., no delimiters such as tab-separated txt files)
| **`read_clipboard`** | Version of read_csv that reads data from the clipboard; useful for converting tables from web pages
| **`read_html`** | Read all tables contained in the given HTML document.
| **`read_json`** | Read data from a JSON (JavaScript Object Notation) string representation

## Reading CSV into Pandas
- read_csv
- index_col
- header

In [15]:
# note: file must be in same folder as jupyter notebook
pd.read_csv('cleaner_gtky.csv')

Unnamed: 0,fake_student_id,time_stamp,class,co_op,dream_career,hobby,song,ai_feels,age_months,pets,credit_hours,work_hours
0,1841.0,2023/09/06 12:59:46 AM AST,Junior,No,Pharmaceutical data scientist job,"playing poker, try new restaurants",,"I think it's the future, and the future looks ...",240.0,0.0,18,30.0
1,1049.0,2023/09/06 5:52:01 AM AST,Senior,Yes,Management,Drawing,https://www.youtube.com/watch?v=A1EhBdsTkl8,"I think it's the future, and the future looks ...",259.0,8.0,16,0.0
2,1508.0,2023/09/06 8:40:59 AM AST,Sophomore,No,I want to be a UI/UX developer.,I love doing yoga and going on long walks!,"""I'll Be There For You""","It will be useful, but it won't change the wor...",230.0,0.0,19,15.0
3,1717.0,2023/09/06 9:07:26 AM AST,Junior,Yes,International Affairs,Singing,These are the Days by Inhaler - https://www.yo...,I'm a bit worried it might be used for ill.,246.0,5.0,17,0.0
4,1745.0,2023/09/06 10:04:32 AM AST,Sophomore,No,research in data science & neuroscience fields,"travelling, reading, painting",,I'm a bit worried it might be used for ill.,230.0,6.0,17,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
159,1875.0,2023/09/09 2:34:29 PM AST,Sophomore,No,Software engineering,Chess,gettin jiggy with it,"It will be useful, but it won't change the wor...",229.0,0.0,19,0.0
160,1383.0,2023/09/09 3:03:11 PM AST,Sophomore,No,Working in the family company in the supply ch...,Golf,The intro song of the Peaky Blinders,"It will be useful, but it won't change the wor...",259.0,5.0,16,0.0
161,1522.0,2023/09/09 4:08:28 PM AST,Junior,Yes,Considering careers/in between. Looking to tra...,Breathing probably,break it out https://www.youtube.com/watch?v=v...,It's going to destroy us all!,251.0,2.0,16,6.0
162,1906.0,2023/09/09 5:37:01 PM AST,Sophomore,No,A more research-oriented path,Reading anything I find interesting,https://m.youtube.com/watch?v=dQw4w9WgXcQ&pp=y...,"I think it's the future, and the future looks ...",232.0,0.0,19,0.0


In [16]:
# how to specify index col (make sure this is uniquely identifiable!)
pd.read_csv('cleaner_gtky.csv', index_col='fake_student_id')

Unnamed: 0_level_0,time_stamp,class,co_op,dream_career,hobby,song,ai_feels,age_months,pets,credit_hours,work_hours
fake_student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1841.0,2023/09/06 12:59:46 AM AST,Junior,No,Pharmaceutical data scientist job,"playing poker, try new restaurants",,"I think it's the future, and the future looks ...",240.0,0.0,18,30.0
1049.0,2023/09/06 5:52:01 AM AST,Senior,Yes,Management,Drawing,https://www.youtube.com/watch?v=A1EhBdsTkl8,"I think it's the future, and the future looks ...",259.0,8.0,16,0.0
1508.0,2023/09/06 8:40:59 AM AST,Sophomore,No,I want to be a UI/UX developer.,I love doing yoga and going on long walks!,"""I'll Be There For You""","It will be useful, but it won't change the wor...",230.0,0.0,19,15.0
1717.0,2023/09/06 9:07:26 AM AST,Junior,Yes,International Affairs,Singing,These are the Days by Inhaler - https://www.yo...,I'm a bit worried it might be used for ill.,246.0,5.0,17,0.0
1745.0,2023/09/06 10:04:32 AM AST,Sophomore,No,research in data science & neuroscience fields,"travelling, reading, painting",,I'm a bit worried it might be used for ill.,230.0,6.0,17,0.0
...,...,...,...,...,...,...,...,...,...,...,...
1875.0,2023/09/09 2:34:29 PM AST,Sophomore,No,Software engineering,Chess,gettin jiggy with it,"It will be useful, but it won't change the wor...",229.0,0.0,19,0.0
1383.0,2023/09/09 3:03:11 PM AST,Sophomore,No,Working in the family company in the supply ch...,Golf,The intro song of the Peaky Blinders,"It will be useful, but it won't change the wor...",259.0,5.0,16,0.0
1522.0,2023/09/09 4:08:28 PM AST,Junior,Yes,Considering careers/in between. Looking to tra...,Breathing probably,break it out https://www.youtube.com/watch?v=v...,It's going to destroy us all!,251.0,2.0,16,6.0
1906.0,2023/09/09 5:37:01 PM AST,Sophomore,No,A more research-oriented path,Reading anything I find interesting,https://m.youtube.com/watch?v=dQw4w9WgXcQ&pp=y...,"I think it's the future, and the future looks ...",232.0,0.0,19,0.0


In [17]:
# look at the first few rows (header)
gtky = pd.read_csv('cleaner_gtky.csv', index_col='fake_student_id')
gtky.head()

Unnamed: 0_level_0,time_stamp,class,co_op,dream_career,hobby,song,ai_feels,age_months,pets,credit_hours,work_hours
fake_student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1841.0,2023/09/06 12:59:46 AM AST,Junior,No,Pharmaceutical data scientist job,"playing poker, try new restaurants",,"I think it's the future, and the future looks ...",240.0,0.0,18,30.0
1049.0,2023/09/06 5:52:01 AM AST,Senior,Yes,Management,Drawing,https://www.youtube.com/watch?v=A1EhBdsTkl8,"I think it's the future, and the future looks ...",259.0,8.0,16,0.0
1508.0,2023/09/06 8:40:59 AM AST,Sophomore,No,I want to be a UI/UX developer.,I love doing yoga and going on long walks!,"""I'll Be There For You""","It will be useful, but it won't change the wor...",230.0,0.0,19,15.0
1717.0,2023/09/06 9:07:26 AM AST,Junior,Yes,International Affairs,Singing,These are the Days by Inhaler - https://www.yo...,I'm a bit worried it might be used for ill.,246.0,5.0,17,0.0
1745.0,2023/09/06 10:04:32 AM AST,Sophomore,No,research in data science & neuroscience fields,"travelling, reading, painting",,I'm a bit worried it might be used for ill.,230.0,6.0,17,0.0


In [18]:
# or the last few (tail)
gtky.tail()

Unnamed: 0_level_0,time_stamp,class,co_op,dream_career,hobby,song,ai_feels,age_months,pets,credit_hours,work_hours
fake_student_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1875.0,2023/09/09 2:34:29 PM AST,Sophomore,No,Software engineering,Chess,gettin jiggy with it,"It will be useful, but it won't change the wor...",229.0,0.0,19,0.0
1383.0,2023/09/09 3:03:11 PM AST,Sophomore,No,Working in the family company in the supply ch...,Golf,The intro song of the Peaky Blinders,"It will be useful, but it won't change the wor...",259.0,5.0,16,0.0
1522.0,2023/09/09 4:08:28 PM AST,Junior,Yes,Considering careers/in between. Looking to tra...,Breathing probably,break it out https://www.youtube.com/watch?v=v...,It's going to destroy us all!,251.0,2.0,16,6.0
1906.0,2023/09/09 5:37:01 PM AST,Sophomore,No,A more research-oriented path,Reading anything I find interesting,https://m.youtube.com/watch?v=dQw4w9WgXcQ&pp=y...,"I think it's the future, and the future looks ...",232.0,0.0,19,0.0
1097.0,2023/09/10 10:41:25 AM AST,Junior,Yes,biotech researcher,"playing/writing music, though my current inter...",,I'm a bit worried it might be used for ill.,246.0,2.0,18,8.0


In [19]:
# look at stats for the different classes
class_gtky = gtky.groupby("class")
class_gtky.pets.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Faculty,1.0,6.0,,6.0,6.0,6.0,6.0,6.0
Junior,40.0,2.6875,2.87493,0.0,1.0,2.0,3.25,13.0
Senior,23.0,2.565217,3.272837,0.0,0.0,2.0,3.0,12.0
Sophomore,99.0,2.626263,2.48089,0.0,1.0,2.0,4.0,10.0


## Saving a DataFrame as a csv
- .to_csv()
- index=False
- header=False
- appending to csv (mode='a', header=None)

In [20]:
# doesn't save index into first column of csv
gtky.to_csv('gtky_copy.csv', index=False)

In [135]:
# doesn't save header into first row of csv
gtky.to_csv('gtky_copy_no_head.csv', header=False)

In [136]:
# why would you want to not save the header?
# you could append to an existing csv with mode = 'a'

gtky.to_csv('gtky_copy2.csv', index=False)
for _ in range(10):
    gtky.to_csv('gtky_copy2.csv', header=False, mode='a')

## What's next?

The example above seems a bit contrived, but imagine you have a web-scraping job which goes to some web page every hour and scrapes it to get some new data.  You could just add the new data as a few new rows to your existing dataset with the syntax shown above.

Next Topics (for Homework 2):
- Getting Data from the internet (APIs and Web Scraping)
- Cleaning Data
- Summarizing data with numerical summaries and plots