# Section 2: Dataframes

## Introduction
One of the most common forms of "data" is *tabular data*. If you are trying to operate on more complex data (e.g. building image recognition software, or music tagging program) you have to start using more nuanced approaches.

However, sometimes the most mundane arrangments of data provide the most insight!

In this lesson, we will be using the *Pandas* Python library in order to manipulate tabular data.

[Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

## CSV Files with Pandas

Load the `data.csv` file using pandas and view the data.

In [1]:
# Follow along...

## Subsetting Data
### Subsetting by Index
The `.iloc` function is useful for numerically subsetting different rows and columns.

In [2]:
# Follow along...

In [3]:
# ON YOUR OWN: Using numerical subsetting...
# Get the first three homework grades of the last three rows of data.


### Subsetting by Name and Chaining

Numerical subsetting is useful, but most of the time, your data columns should be clearly labeled.

In [4]:
# Follow along

In [5]:
# ON YOUR OWN: Much like the previous on your own...
# Get the first three homework grades of the last three rows of data, 
# but do it in ONE LINE using column name subsetting and chaining.

# PUT SOLUTION HERE

## Conditional Selection

Alright, now time to get into the interesting stuff

In [6]:
# Follow Along..

## Reduction and Aggregation

In [7]:
# Follow along...

In [8]:
# ON YOUR OWN
# Out of all the students who DID NOT have hints...
# how many scored an A average (higher or equal to 93) on their tests?
# HINT: "~" is the NOT operator in pandas

In [9]:
# CHALLENGE: Can you do it in one line? 
# QUESTION: Would you want to use this one-liner in a "real" scenario (e.g. a job)? Why or why not?


## Creating New Columns

Sometimes, it's useful to store some transformed operations. It's easy to create new columns!

In [None]:
# Follow along

## Complex Operations and Group Bys

In [11]:
# Follow Along...

# Group

## Aside: Vectorized Operations



In [None]:
n = 100000
import random
import time


df = pd.DataFrame(
    {"a": [random.random() for i in range(n)],
     "b": [random.random() for i in range(n)]},
     index = list(range(n)))

In [None]:
end_sums = []

start = time.time()
for i in range(df.shape[0]):
    temp_sum = df.iloc[i,0]+df.iloc[i,1]
    end_sums.append(temp_sum)
end = time.time()

basic_time = end-start

print("Elapsed Time: {0}".format(basic_time))

In [None]:
start = time.time()
summed_df = df.sum(axis=1)
end = time.time()

print(summed_df.iloc[0])

vector_time = end-start

print("Elapsed Time: {0}".format(vector_time))

In [None]:
basic_time/vector_time

If I save that dataframe to a file, it turns out to be around 5MB.

Large data analysis projects deal with files on the *gigabyte* or *terabyte* Scale. 

For example: [The Pile](https://pile.eleuther.ai/), used for Text Language Modeling (such as GPT and other LLMs), clocks in at around 800GB. 