# Operating On Data Collections

There are multiple different ways to operate over objects. First, we will start with the most basic for of operating on a data collection: looping.

In [1]:
import numpy as np
import seaborn as sns

## Looping

So far we have been using the `for` routine to run over data collection objects. 

### Lists

A list can be provided directly into the `for` loop and it will iterate over each element in the list.

In [14]:
for x in [1,2,3,4,5]:
    print(x)

1
2
3
4
5


If you want to affect any items on the list, you will need to operate on each individual item and append them to a new list.

In [3]:
new_list = []

for x in [1,2,3,4,5]:
    new_list.append(x**2)

new_list    

[1, 4, 9, 16, 25]

### Arrays

You can perform the same operations with arrays

In [3]:
for x in np.array([1,2,3,4,5]):
    print(x)

1
2
3
4
5


### DataFrames

In [2]:
tips = sns.load_dataset('tips')

Similarly with DataFrames, it is possible to loop over them to apply an operation, however this is rarely the best option. 

In [8]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


Running a `for` loop, however, only returns the column names.

In [7]:
for x in tips:
    print(x)

total_bill
tip
sex
smoker
day
time
size


## Pandas Iteration Functions

Fortunately, there are some functions that can iterate over dataframes.

### iterrows()

There is a specific function in Pandas to pull out the index and row from each line of a DataFrame. Here we will run it over the first two records of our DataFrame (`tips.head(2)`)

In [31]:
for index, row in tips.head(2).iterrows():
    print(index)
    print(row)

0
total_bill     16.99
tip             1.01
sex           Female
smoker            No
day              Sun
time          Dinner
size               2
Name: 0, dtype: object
1
total_bill     10.34
tip             1.66
sex             Male
smoker            No
day              Sun
time          Dinner
size               3
Name: 1, dtype: object


We can pull out particular values using the square brackets to subset `row`. For instance, we can pull out the party size by using `row[6]`.

In [26]:
for index, row in tips.head().iterrows():
    print(f"{row[6]} people tipped a total of {row[1]}")

2 people tipped a total of 1.01
3 people tipped a total of 1.66
3 people tipped a total of 3.5
2 people tipped a total of 3.31
4 people tipped a total of 3.61


### apply()

We can also take advantage of a very usefull method associated with DataFrames: `apply()`. This allows us to pass in a function to operate on our data. (This is where those anonymous functions we learned about previously will come in handy...)

In [24]:
tips['length'] = tips['sex'].apply(len)
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,length
0,16.99,1.01,Female,No,Sun,Dinner,2,6
1,10.34,1.66,Male,No,Sun,Dinner,3,4
2,21.01,3.5,Male,No,Sun,Dinner,3,4
3,23.68,3.31,Male,No,Sun,Dinner,2,4
4,24.59,3.61,Female,No,Sun,Dinner,4,6


Say we want to create a new column, to identify whether it was a weekday or weekend. Let's first create the function we want to use:

In [43]:
def stage_of_week(df):
    if df in ['Sat', 'Sun']:
        return 'Weekend'
    else:
        return "Weekday"

In [45]:
tips['weekday'] = tips['day'].apply(stage_of_week)
tips.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,length,weekday
239,29.03,5.92,Male,No,Sat,Dinner,3,4,Weekend
240,27.18,2.0,Female,Yes,Sat,Dinner,2,6,Weekend
241,22.67,2.0,Male,Yes,Sat,Dinner,2,4,Weekend
242,17.82,1.75,Male,No,Sat,Dinner,2,4,Weekend
243,18.78,3.0,Female,No,Thur,Dinner,2,6,Weekday


## Vectorization

Because DataFrames and Series are both built upon arrays, we can perform what are called *vectorized* operations. This means, instead of operating on each element one at a time (like a scalar), we can operate on the whole object in one go (like a vector).

### NumPy Array

In [7]:
num_array = np.array([1,2,3,4,5,6,7])

In [11]:
num_array / 2

array([0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5])

In [13]:
num_array ** 2

array([ 1,  4,  9, 16, 25, 36, 49])

### Pandas Series

In the same way, we can operate on the columns of our DataFrame (known as Pandas Series')

In [3]:
tips['per_person'] = tips['total_bill'] / tips['size']
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,per_person
0,16.99,1.01,Female,No,Sun,Dinner,2,8.495
1,10.34,1.66,Male,No,Sun,Dinner,3,3.446667
2,21.01,3.5,Male,No,Sun,Dinner,3,7.003333
3,23.68,3.31,Male,No,Sun,Dinner,2,11.84
4,24.59,3.61,Female,No,Sun,Dinner,4,6.1475


## Summary

There are two mays to operate on DataFrames: iteration and vectorisation. If you can, use vectorization. If this is not possible, you can use `apply()` to operate a function on each row. As a last resort, if all else fails, `iterrows()` let's you iterate over every row one at a time. 