# Data Manipulation with pandas

In [1]:
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import datasets
homelessness = pd.read_csv("data/homelessness.csv")
walmart = pd.read_csv("data/walmart.csv")
temperatures = pd.read_csv("data/temperatures.csv")
#avocado = pd.read_csv("DataSets/avocado.csv")

# 1. Transforming DataFrames

How to inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns.

In [2]:
# Dataframe for chapter 1
homelessness.head(2)

Unnamed: 0,region,state,individuals,family_members,state_pop
0,East South Central,Alabama,2570.0,864.0,4887681
1,Pacific,Alaska,1434.0,582.0,735139


A DataFrames (df) are composed of three partes:
- Numpy array for data (`df.values`)
- one index to store row (`df.index`)
- one index to store column (`df.columns`)

## Inspecting

- `.head()`: returns the first few rows (the “head” of the DataFrame).
- `.tail()`: returns the last few rows
- `.sample()`: return a random sample of rows from the data frame

In [3]:
# display the head of the homelessness data 
print(homelessness.head(5))

               region       state  individuals  family_members  state_pop
0  East South Central     Alabama       2570.0           864.0    4887681
1             Pacific      Alaska       1434.0           582.0     735139
2            Mountain     Arizona       7259.0          2606.0    7158024
3  West South Central    Arkansas       2280.0           432.0    3009733
4             Pacific  California     109008.0         20964.0   39461588


In [4]:
# display the tail of the homelessness data
print(homelessness.tail(3))

                region          state  individuals  family_members  state_pop
48      South Atlantic  West Virginia       1021.0           222.0    1804291
49  East North Central      Wisconsin       2740.0          2167.0    5807406
50            Mountain        Wyoming        434.0           205.0     577601


In [5]:
# display samples of the homelessness data
display(homelessness.sample(3))

Unnamed: 0,region,state,individuals,family_members,state_pop
50,Mountain,Wyoming,434.0,205.0,577601
41,West North Central,South Dakota,836.0,323.0,878698
8,South Atlantic,District of Columbia,3770.0,3134.0,701547


- `.info()`: shows information on each of the columns, such as the data type and number of missing values.
- `.shape` :returns the number of rows and columns of the DataFrame.
- `.describe()`: calculates a few summary statistics for each column.

In [6]:
# display information about homelessness
display(homelessness.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.1+ KB


None

In [7]:

# display the shape of homelessness
display(homelessness.shape)

(51, 5)

In [8]:
# display a description of homelessness
display(homelessness.describe())

Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0
mean,7225.784314,3504.882353,6405637.0
std,15991.025083,7805.411811,7327258.0
min,434.0,75.0,577601.0
25%,1446.5,592.0,1777414.0
50%,3082.0,1482.0,4461153.0
75%,6781.5,3196.0,7340946.0
max,109008.0,52070.0,39461590.0


- `.values`: A two-dimensional NumPy array of values.
- `.columns`: An index of columns: the column names.
- `.index`: An index for the rows: either row numbers or row names.

In [9]:
# display the values of homelessness
display(homelessness.values)

array([['East South Central', 'Alabama', 2570.0, 864.0, 4887681],
       ['Pacific', 'Alaska', 1434.0, 582.0, 735139],
       ['Mountain', 'Arizona', 7259.0, 2606.0, 7158024],
       ['West South Central', 'Arkansas', 2280.0, 432.0, 3009733],
       ['Pacific', 'California', 109008.0, 20964.0, 39461588],
       ['Mountain', 'Colorado', 7607.0, 3250.0, 5691287],
       ['New England', 'Connecticut', 2280.0, 1696.0, 3571520],
       ['South Atlantic', 'Delaware', 708.0, 374.0, 965479],
       ['South Atlantic', 'District of Columbia', 3770.0, 3134.0, 701547],
       ['South Atlantic', 'Florida', 21443.0, 9587.0, 21244317],
       ['South Atlantic', 'Georgia', 6943.0, 2556.0, 10511131],
       ['Pacific', 'Hawaii', 4131.0, 2399.0, 1420593],
       ['Mountain', 'Idaho', 1297.0, 715.0, 1750536],
       ['East North Central', 'Illinois', 6752.0, 3891.0, 12723071],
       ['East North Central', 'Indiana', 3776.0, 1482.0, 6695497],
       ['West North Central', 'Iowa', 1711.0, 1038.0, 3148618]

In [10]:
# display the column index of homelessness
display(homelessness.columns)

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')

In [11]:
# display the row index of homelessness
display(homelessness.index)

RangeIndex(start=0, stop=51, step=1)

## Sorting values

`sort_values()` is a method in the Pandas library used to sort a data frame based on one or multiple columns. The sorting can be done in ascending or descending order. By combining `.sort_values()` with `.head()`, you can answer questions in the form, "What are the top cases where…?".

In [12]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members", ascending = False )

# display the top few rows
display(homelessness_fam.head())

Unnamed: 0,region,state,individuals,family_members,state_pop
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,Pacific,California,109008.0,20964.0,39461588
21,New England,Massachusetts,6811.0,13257.0,6882635
9,South Atlantic,Florida,21443.0,9587.0,21244317
43,West South Central,Texas,19199.0,6111.0,28628666


In [13]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True,False])

# display the top few rows
display(homelessness_reg_fam.head())

Unnamed: 0,region,state,individuals,family_members,state_pop
13,East North Central,Illinois,6752.0,3891.0,12723071
35,East North Central,Ohio,6929.0,3320.0,11676341
22,East North Central,Michigan,5209.0,3142.0,9984072
49,East North Central,Wisconsin,2740.0,2167.0,5807406
14,East North Central,Indiana,3776.0,1482.0,6695497


## Subseting

### Subseting columns

Square brackets $([~~])$ can be used to select only the columns that matter. To select only "col_a" of the DataFrame df, use

`df["col_a"]`

To select "col_a" and "col_b" of df, use

`df[["col_a", "col_b"]]`

For example, consider the following cases, where the dataset `homelessness` has the following columns:

In [14]:
homelessness.head(0)

Unnamed: 0,region,state,individuals,family_members,state_pop


In [15]:
# Select the state and family_members columns
state_fam = homelessness[["state","family_members"]]

# display the head of the result
display(state_fam.head(2))

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0


In [16]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals","state"]]

# display the head of the result
display(ind_state.head(2))

Unnamed: 0,individuals,state
0,2570.0,Alabama
1,1434.0,Alaska


### Subsetting rows - Boolean indexing

With Boolean indexing we can use a boolean condition to select rows where the condition is True. Consider the following examples:

In [17]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"] > 10000]

# See the result
display(ind_gt_10k)

Unnamed: 0,region,state,individuals,family_members,state_pop
4,Pacific,California,109008.0,20964.0,39461588
9,South Atlantic,Florida,21443.0,9587.0,21244317
32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,Pacific,Oregon,11139.0,3337.0,4181886
43,West South Central,Texas,19199.0,6111.0,28628666
47,Pacific,Washington,16424.0,5880.0,7523869


In [18]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[ homelessness["region" ] == "Mountain" ]

# See the result
display(mountain_reg)

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
5,Mountain,Colorado,7607.0,3250.0,5691287
12,Mountain,Idaho,1297.0,715.0,1750536
26,Mountain,Montana,983.0,422.0,1060665
28,Mountain,Nevada,7058.0,486.0,3027341
31,Mountain,New Mexico,1949.0,602.0,2092741
44,Mountain,Utah,1904.0,972.0,3153550
50,Mountain,Wyoming,434.0,205.0,577601


In [19]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[ (homelessness["family_members"]< 1000)&(homelessness["region"]== "Pacific")]

# See the result
display(fam_lt_1k_pac)

Unnamed: 0,region,state,individuals,family_members,state_pop
1,Pacific,Alaska,1434.0,582.0,735139


**Subsetting rows by categorical variables**

Subsetting data based on a categorical variable often involves using the "or" operator $(~~ | ~~)$ to select rows from multiple categories. 

In [20]:
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[ (homelessness["region"] == "South Atlantic" )| (homelessness["region"] == "Mid-Atlantic")]

# See the result
display(south_mid_atlantic.head(6))

Unnamed: 0,region,state,individuals,family_members,state_pop
7,South Atlantic,Delaware,708.0,374.0,965479
8,South Atlantic,District of Columbia,3770.0,3134.0,701547
9,South Atlantic,Florida,21443.0,9587.0,21244317
10,South Atlantic,Georgia,6943.0,2556.0,10511131
20,South Atlantic,Maryland,4914.0,2230.0,6035802
30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025


**`.isin()` method**

This can get tedious when you want all states in one of three different regions, for example. Instead, use the `.isin()` method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

In [21]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
display(mojave_homelessness)

Unnamed: 0,region,state,individuals,family_members,state_pop
2,Mountain,Arizona,7259.0,2606.0,7158024
4,Pacific,California,109008.0,20964.0,39461588
28,Mountain,Nevada,7058.0,486.0,3027341
44,Mountain,Utah,1904.0,972.0,3153550


## Adding columns

In [22]:
# Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"] 

# Add p_individuals col as proportion of individuals
homelessness["p_individuals"] = homelessness["individuals"]/homelessness["total"]

# See the result
display(homelessness.head(3))

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals
0,East South Central,Alabama,2570.0,864.0,4887681,3434.0,0.748398
1,Pacific,Alaska,1434.0,582.0,735139,2016.0,0.71131
2,Mountain,Arizona,7259.0,2606.0,7158024,9865.0,0.735834


We can mix and match the four manipulations (**sorting rows, subsetting columns, subsetting rows, and adding new columns**) to answer some questions like, "Which state has the highest number of homeless individuals per 10,000 people in the state?" 

In [23]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"]> 20] 

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values(["indiv_per_10k"] , ascending = False )

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
display(result)

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
47,Washington,21.829195
32,New York,20.392363


# 2. Aggregating DataFrames

How calculate summary statistics on DataFrame columns, grouped summary statistics and pivot tables.

In [24]:
# Dataframe for chapter 2
walmart.head(3)

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808


## Summary statistics

Summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics that we can get from a dataframe.

In the following example we can see that the mean weekly sales amount is almost double the median weekly sales amount. This can tell you that there are a few very high sales weeks that are making the mean so much higher than the median.

In [25]:
# display the info about the sales DataFrame
display('info', walmart.info())

# display the mean of weekly_sales
display('mean of weekly_sales',walmart['weekly_sales'].mean())

# display the median of weekly_sales
display('median of weekly_sales', walmart['weekly_sales'].median())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   store                 10774 non-null  int64  
 1   type                  10774 non-null  object 
 2   department            10774 non-null  int64  
 3   date                  10774 non-null  object 
 4   weekly_sales          10774 non-null  float64
 5   is_holiday            10774 non-null  bool   
 6   temperature_c         10774 non-null  float64
 7   fuel_price_usd_per_l  10774 non-null  float64
 8   unemployment          10774 non-null  float64
dtypes: bool(1), float64(4), int64(2), object(2)
memory usage: 684.0+ KB


'info'

None

'mean of weekly_sales'

23843.95014850566

'median of weekly_sales'

12049.064999999999

### Efficient summaries with `.agg()`

The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

`df['column'].agg(f)`

will apply the function $f$ to the column named 'column' of the dataframe df.


__Remarks:__ **The interquartile range (IQR)**

IQR is a measure of variability, based on dividing a dataset into quartiles. It is defined as the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$).

The IQR *gives a sense of how spread* out the values in a dataset are and is a robust measure of dispersion that is not influenced by outliers. It is often used in exploratory data analysis to detect outliers and to summarize a large dataset.

- To calculate the IQR
  1. first, the data must be sorted in ascending or descending order. 
  2. $Q_1$ is found by taking the median of the first half of the data, and $Q_3$ is found by taking the median of the second half of the data. 
  3. Finally, the IQR is calculated as $Q_3 - Q_1$.

- For example, if a dataset has the following values: $[3, 5, 7, 8, 9, 11, 14, 15, 16, 17]$, 
  -  $Q_1$ would be 8 (median of $[3, 5, 7, 8]$), 
  -  $Q_3$ would be 16 (median of $[14, 15, 16, 17]$), 
  -   IQR would be $16 - 8 = 8$.

For this case, I create a custom function to calculate the IQR (inter-quartile range) that we are going to use for the columns.

In [26]:
# A custom IQR function
def iqr(column):
    q_3 = column.quantile(0.75)
    q_1 = column.quantile(0.25)
    return q_3 - q_1

In [27]:
# display IQR of the temperature_c column
display('temperature_c IQR',walmart['temperature_c'].agg(iqr))

'temperature_c IQR'

16.583333333333336

In [28]:
# Update to display IQR of temperature_c, fuel_price_usd_per_l, & unemployment
display('IQR of three columns',walmart[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg(iqr))

'IQR of three columns'

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64

In [29]:
# Update to display IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
display('IQR and median ',walmart[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))

'IQR and median '

Unnamed: 0,temperature_c,fuel_price_usd_per_l,unemployment
iqr,16.583333,0.073176,0.565
median,16.966667,0.743381,8.099


## Groupby

The basic idea behind `.groupby()` is to split a large dataframe into smaller groups based on the values of one or more columns. The function then applies an aggregation  function `.agg()` to each group, producing a new dataframe that summarizes the information in each group.

Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C."  We can calculate the total weekly sales for each type of stores and take the proportions:

In [30]:
# Calc total weekly sales
sales_all = walmart["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
sales_A = walmart[walmart["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
sales_B = walmart[walmart["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
sales_C = walmart[walmart["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
display('proportion',sales_propn_by_type)

'proportion'

array([0.9097747, 0.0902253, 0.       ])

or in a more sofistecated way, using  `.groupby()` :

In [31]:
# Group by type; calc total weekly sales
sales_by_type = walmart.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type / sum(sales_by_type)
display(sales_propn_by_type)

type
A    0.909775
B    0.090225
Name: weekly_sales, dtype: float64

We saw how `.agg()` method is useful to compute multiple statistics on multiple variables. It also works with `.groupby()` to summarize information of each subset of the data frame of interest.

In [32]:
# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = walmart.groupby("type")['weekly_sales'].agg([np.min,np.max,np.mean,np.median])

# display sales_stats
display(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = walmart.groupby("type")[["unemployment", "fuel_price_usd_per_l"]]\
                          .agg([np.min, np.max, np.mean, np.median])
# display unemp_fuel_stats
display(unemp_fuel_stats)

Unnamed: 0_level_0,amin,amax,mean,median
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,-1098.0,293966.05,23674.667242,11943.92
B,-798.0,232558.51,25696.67837,13336.08


Unnamed: 0_level_0,unemployment,unemployment,unemployment,unemployment,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l
Unnamed: 0_level_1,amin,amax,mean,median,amin,amax,mean,median
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,3.879,8.992,7.972611,8.067,0.664129,1.10741,0.744619,0.735455
B,7.17,9.765,9.279323,9.199,0.760023,1.107674,0.805858,0.803348


## Pivot tables

Pivot tables allow you to take a multi-dimensional data set and transform it into a two-dimensional data set that is easier to understand and analyze. They can help you identify patterns and relationships in your data, and can be used to create interactive dashboards and reports.

A pivot table in Pandas is created using the `.pivot_table` method. You specify the columns of the dataframe to use as the row and column indices, and the values to be aggregated. The values can be aggregated using a variety of functions, such as sum, mean, count, etc. **By default, pivot_table uses the mean as the aggregation function**, but you can specify a different aggregation function using the aggfunc parameter.

Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the `.pivot_table()` method is just an alternative to `.groupby()`. 

In [33]:
# mean weekly_sales for each store type 
mean_sales_by_type = walmart.groupby("type")["weekly_sales"].mean()

display(mean_sales_by_type)

type
A    23674.667242
B    25696.678370
Name: weekly_sales, dtype: float64

In [34]:
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = walmart.pivot_table(values = "weekly_sales",index = "type")

# display mean_sales_by_type
display(mean_sales_by_type)

Unnamed: 0_level_0,weekly_sales
type,Unnamed: 1_level_1
A,23674.667242
B,25696.67837


We can return a pivot table with Multiple statistics 

In [35]:
# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = walmart.pivot_table(values ="weekly_sales", index ="type", aggfunc =[np.mean,np.median])


display(mean_med_sales_by_type)

Unnamed: 0_level_0,mean,median
Unnamed: 0_level_1,weekly_sales,weekly_sales
type,Unnamed: 1_level_2,Unnamed: 2_level_2
A,23674.667242,11943.92
B,25696.67837,13336.08


or create a pivot table with two variables (columns). For this we need to specify the index, to group elements in rows, and the columns name that wil be used as the column for the pivot table. 

In [36]:
# Pivot for mean weekly_sales by store type and holiday 
mean_sales_by_type_holiday = walmart.pivot_table(values="weekly_sales", index="type", columns="is_holiday")


display(mean_sales_by_type_holiday)

is_holiday,False,True
type,Unnamed: 1_level_1,Unnamed: 2_level_1
A,23768.583523,590.04525
B,25751.980533,810.705


In [37]:
# display mean weekly_sales by department and type; fill missing values with 0
display(walmart.pivot_table(values = 'weekly_sales',index = "type", columns = 'department', fill_value = 0))

department,1,2,3,4,5,6,7,8,9,10,...,90,91,92,93,94,95,96,97,98,99
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,30961.725379,67600.158788,17160.002955,44285.399091,34821.011364,7136.292652,38454.336818,48583.475303,30120.449924,30930.456364,...,85776.905909,70423.165227,139722.204773,53413.633939,60081.155303,123933.787121,21367.042857,28471.26697,12875.423182,379.123659
B,44050.626667,112958.526667,30580.655,51219.654167,63236.875,10717.2975,52909.653333,90733.753333,66679.301667,48595.126667,...,14780.21,13199.6025,50859.278333,1466.274167,161.445833,77082.1025,9528.538333,5828.873333,217.428333,0.0


In [38]:
# display the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
display(walmart.pivot_table(values="weekly_sales", index="department", columns="type", fill_value = 0, margins = True))

type,A,B,All
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,30961.725379,44050.626667,32052.467153
2,67600.158788,112958.526667,71380.022778
3,17160.002955,30580.655000,18278.390625
4,44285.399091,51219.654167,44863.253681
5,34821.011364,63236.875000,37189.000000
...,...,...,...
96,21367.042857,9528.538333,20337.607681
97,28471.266970,5828.873333,26584.400833
98,12875.423182,217.428333,11820.590278
99,379.123659,0.000000,379.123659


# 3. Slicing and Indexing DataFrames

Indexes are supercharged row and column names. They can be combined with slicing for DataFrame subsetting.

In [39]:
# Dataframe for chapter 3
temperatures.head(3)

Unnamed: 0,date,city,country,avg_temp_c
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293
1,2000-02-01,Abidjan,Côte D'Ivoire,27.685
2,2000-03-01,Abidjan,Côte D'Ivoire,29.061


## Indexes 

Index can be thought of as the primary key for a DataFrame, similar to how a primary key is used in a relational database. 
By default, a DataFrame in Pandas is indexed by a sequence of integers starting from 0, but you can also specify a custom index using `.set_index()`. 

Setting a column as the index can be useful in a number of scenarios, such as when you have a unique identifier for each row in the DataFrame. For example, you might have a column containing unique customer IDs and you might want to set that column as the index of your DataFrame. This allows you to access rows in the DataFrame based on the customer ID, rather than the integer index.

The index in pandas is also important because it is used to align data during operations such as merging and concatenating DataFrames. Additionally, the index is used when performing operations like grouping, aggregating, and slicing data.
  

There are two index types in pandas, `.index` and `.columns`, this is because a DataFrame has two axes: rows and columns. The ´.index´ attribute represents the index labels of the rows, while the ´.columns´ attribute represents the column labels of the DataFrame. 

In [40]:
# Columns index 
display('Columns index ',temperatures.columns)

# Row index 
display('row indexes',temperatures.index)

'Columns index '

Index(['date', 'city', 'country', 'avg_temp_c'], dtype='object')

'row indexes'

RangeIndex(start=0, stop=16500, step=1)

**Set index**

The `.set_index()` method allows to set one or more columns as the index of a DataFrame. Here's an example of how to use the set_index method to set a single column as the index of a DataFrame:

In [41]:
# Set the default index by city
temperatures_ind = temperatures.set_index('city')
temperatures_ind.head(2)

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abidjan,2000-01-01,Côte D'Ivoire,27.293
Abidjan,2000-02-01,Côte D'Ivoire,27.685


Setting a column as the index changes the shape and structure of the DataFrame, and so it is no longer part of the regular columns. The `.columns` attribute of a DataFrame returns only the column labels that correspond to the regular columns, not the index.

The reason is that the index in a Pandas DataFrame serves a different purpose than the regular columns. The index is used to identify each row in the DataFrame and to support fast lookups and alignment of data between different DataFrames. Setting a column as the index of a DataFrame allows you to access the rows based on the values in that column, rather than the default integer index.

In [42]:
# Row index 
display("New row indexes", temperatures_ind.index)

# Column index
display('new columns index', temperatures_ind.columns)

'New row indexes'

Index(['Abidjan', 'Abidjan', 'Abidjan', 'Abidjan', 'Abidjan', 'Abidjan',
       'Abidjan', 'Abidjan', 'Abidjan', 'Abidjan',
       ...
       'Xian', 'Xian', 'Xian', 'Xian', 'Xian', 'Xian', 'Xian', 'Xian', 'Xian',
       'Xian'],
      dtype='object', name='city', length=16500)

'new columns index'

Index(['date', 'country', 'avg_temp_c'], dtype='object')

**Reset index**

To undo the set index, we can reset the index with the method `.reset_index()`, or we can exclude permanently the index with `.reset_index(drop = True)`.

In [43]:
# Reset the index, keeping its contents
display("reset index",temperatures_ind.reset_index().head(2))

# Reset the index, dropping its contents
display("drop index",temperatures_ind.reset_index(drop = True).head(2))

'reset index'

Unnamed: 0,city,date,country,avg_temp_c
0,Abidjan,2000-01-01,Côte D'Ivoire,27.293
1,Abidjan,2000-02-01,Côte D'Ivoire,27.685


'drop index'

Unnamed: 0,date,country,avg_temp_c
0,2000-01-01,Côte D'Ivoire,27.293
1,2000-02-01,Côte D'Ivoire,27.685


**Sort index values**

The `.sort_index` method sorts the DataFrame along the index axis and returns a new DataFrame sorted by the index values. By default, the `.sort_index` method sorts the DataFrame in ascending order.

In [44]:
# Sort temperatures_ind by index values
display(temperatures_ind.sort_index().head(2))

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abidjan,2000-01-01,Côte D'Ivoire,27.293
Abidjan,2008-11-01,Côte D'Ivoire,27.302


In [45]:
# Sort temperatures_ind by index values at the city level
display(temperatures_ind.sort_index(level=["city"]).head(2))

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abidjan,2000-01-01,Côte D'Ivoire,27.293
Abidjan,2008-11-01,Côte D'Ivoire,27.302


In [46]:
# Set the default index by country and city
temperatures_ind = temperatures.set_index(["country", "city"])

# Sort temperatures_ind by country then descending city
display(temperatures_ind.sort_index(level=["country", "city"], ascending = [True, False]).head(2))

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,Kabul,2000-01-01,3.326
Afghanistan,Kabul,2000-02-01,3.454


## Subsetting rows

There are several ways to subset rows in pandas based on specific criteria.
1. Boolean Indexing: use boolean indexing to filter the rows of a DataFrame based on a condition (as previously discussed)
2. `.loc`: The `.loc` method allows you to subset rows based on **label values**
3. `.iloc`: The `.iloc` method allows you to subset rows based on **integer position** (explained in the next section)

- **Subset rows with brackets and boolean index**

In [47]:
# Subset temperatures using square brackets and boolean index
display(temperatures[ (temperatures["city"] == "Moscow") | (temperatures["city"] == "Saint Petersburg")].head(2))

Unnamed: 0,date,city,country,avg_temp_c
10725,2000-01-01,Moscow,Russia,-7.313
10726,2000-02-01,Moscow,Russia,-3.551


- **Subset rows with brackets and `.isin` method**

In [48]:
# Make a list of cities to subset on
cities = ["Moscow", "Saint Petersburg"]
# Subset temperatures using square brackets
display(temperatures[temperatures['city'].isin(cities)].head(2))

Unnamed: 0,date,city,country,avg_temp_c
10725,2000-01-01,Moscow,Russia,-7.313
10726,2000-02-01,Moscow,Russia,-3.551


- **Subset rows with `.loc` method (clean way)**

The `.loc` method is used to access and manipulate data in a DataFrame by label. **When a column is set as the index of a DataFrame, we can use the `.loc`** method to access and manipulate the data based on the index label.

In [49]:
# Set the default index by city, .loc only work with indexes
temperatures_ind = temperatures.set_index('city')

# Subset temperatures_ind using .loc[]
display(temperatures_ind.loc[cities].head(2))

Unnamed: 0_level_0,date,country,avg_temp_c
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Moscow,2000-01-01,Russia,-7.313
Moscow,2000-02-01,Russia,-3.551


**Setting multi-level indexes and subseting**

Indexes can also be made out of multiple columns, forming a multi-level index (sometimes called a hierarchical index). There is a trade-off to using these.

The benefit is that multi-level indexes make it more natural to reason about nested categorical variables. For example, in a clinical trial, you might have control and treatment groups. Then each test subject belongs to one or another group, and we can say that a test subject is nested inside the treatment group. Similarly, in the temperature dataset, the city is located in the country, so we can say a city is nested inside the country.

The order of the columns in the MultiIndex is determined by the order in which they are specified in the `.set_index` method, as follows

In [50]:
# Index temperatures by country & city
temperatures_ind =  temperatures.set_index(["country","city"])
display(temperatures_ind.head(2))

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Côte D'Ivoire,Abidjan,2000-01-01,27.293
Côte D'Ivoire,Abidjan,2000-02-01,27.685


In this case we must pass a list of tuples to subset multiples rows for the columns transformed into indexes.

In [51]:
# List of tuples: Brazil, Rio De Janeiro & Pakistan, Lahore
rows_to_keep = [("Brazil", "Rio De Janeiro"),("Pakistan","Lahore")]

# Subset for rows to keep
display(temperatures_ind.loc[rows_to_keep].head(2))

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Brazil,Rio De Janeiro,2000-01-01,25.974
Brazil,Rio De Janeiro,2000-02-01,26.699


## Slicing with `.loc` and `.iloc` (slicing rows and columns)

In pandas, slicing and subsetting refer to the same operation of extracting a portion of a DataFrame. 

- **Subsetting:** refers to the operation of selecting a portion of a DataFrame based on index values or conditions.
- **Slicing:** is a term commonly used in python to refer to the operation of selecting a portion of a sequence (e.g. list, tuple, string, np.array etc.) based on index values.

### Slicing with loc

Slicing lets you select consecutive elements of an object using `first:last` syntax. DataFrames can be sliced by index values or by row/column number; we'll start with the first case. This involves slicing inside the `.loc[]` method.

- Compared to slicing lists, there are a few things to remember.

    - You can only slice an index if the index is sorted (using `.sort_index()`).
    - To slice at the **outer level**, first and last **can be strings**.
    - To slice at **inner levels**, first and last **should be tuples**.
    - If you pass a single slice to `.loc[]`, it will slice the rows.

- **Slicing rows**

In [52]:
# set the index temperatures by country & city
temperatures_ind =  temperatures.set_index(["country","city"])

# Sort the index of temperatures_ind
temperatures_srt = temperatures_ind.sort_index()

# Subset rows from Pakistan to Russia
display(temperatures_srt.loc["Pakistan":"Russia"].head())

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Pakistan,Faisalabad,2000-01-01,12.792
Pakistan,Faisalabad,2000-02-01,14.339
Pakistan,Faisalabad,2000-03-01,20.309
Pakistan,Faisalabad,2000-04-01,29.072
Pakistan,Faisalabad,2000-05-01,34.845


In [53]:
# Try to subset rows from Lahore to Moscow
display(temperatures_srt.loc["Lahore" : "Moscow"].head())

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Mexico,Mexico,2000-01-01,12.694
Mexico,Mexico,2000-02-01,14.677
Mexico,Mexico,2000-03-01,17.376
Mexico,Mexico,2000-04-01,18.294
Mexico,Mexico,2000-05-01,18.562


In [54]:
# Subset rows from Pakistan, Lahore to Russia, Moscow
display(temperatures_srt.loc[("Pakistan", "Lahore"):("Russia", "Moscow")].head())

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Pakistan,Lahore,2000-01-01,12.792
Pakistan,Lahore,2000-02-01,14.339
Pakistan,Lahore,2000-03-01,20.309
Pakistan,Lahore,2000-04-01,29.072
Pakistan,Lahore,2000-05-01,34.845


- **Slicing rows and columns**

 Slice both dimensions at once: `df.loc[:, :]`

In [55]:
# Slice rows from India, Hyderabad to Iraq, Baghdad
display(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")].head() )

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
India,Hyderabad,2000-01-01,23.779
India,Hyderabad,2000-02-01,25.826
India,Hyderabad,2000-03-01,28.821
India,Hyderabad,2000-04-01,32.698
India,Hyderabad,2000-05-01,32.438


In [56]:
# Slice columns from date to avg_temp_c
display(temperatures_srt.loc[: , "date":"avg_temp_c" ].head())

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
Afghanistan,Kabul,2000-01-01,3.326
Afghanistan,Kabul,2000-02-01,3.454
Afghanistan,Kabul,2000-03-01,9.612
Afghanistan,Kabul,2000-04-01,17.925
Afghanistan,Kabul,2000-05-01,24.658


In [57]:
# Slice in both directions at once
display(temperatures_srt.loc[ ("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c" ].head())

Unnamed: 0_level_0,Unnamed: 1_level_0,date,avg_temp_c
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1
India,Hyderabad,2000-01-01,23.779
India,Hyderabad,2000-02-01,25.826
India,Hyderabad,2000-03-01,28.821
India,Hyderabad,2000-04-01,32.698
India,Hyderabad,2000-05-01,32.438


- **Slicing time series**
  
Slicing is particularly useful for time series since it's a common thing to want to filter for data within a date range. Add the date column to the index, then use `.loc[]` to perform the subsetting. The important thing to remember is to keep your dates in ISO 8601 format, that is, "`yyyy-mm-dd`" for year-month-day, "`yyyy-mm`" for year-month, and "`yyyy`" for year.

In [58]:
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures['date'] >= "2010-01-01") & (temperatures['date'] <= "2011-12-01")]
display(temperatures_bool.head(3))

Unnamed: 0,date,city,country,avg_temp_c
120,2010-01-01,Abidjan,Côte D'Ivoire,28.27
121,2010-02-01,Abidjan,Côte D'Ivoire,29.262
122,2010-03-01,Abidjan,Côte D'Ivoire,29.596


In [59]:
# Set date as the index and sort the index
temperatures_ind = temperatures.set_index("date").sort_index()

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
display(temperatures_ind.loc['2010':'2011'].head(3))

Unnamed: 0_level_0,city,country,avg_temp_c
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,Faisalabad,Pakistan,11.81
2010-01-01,Melbourne,Australia,20.016
2010-01-01,Chongqing,China,7.921


In [60]:
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
display(temperatures_ind.loc["2010-08":"2011-02"].head(3))

Unnamed: 0_level_0,city,country,avg_temp_c
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-08-01,Calcutta,India,30.226
2010-08-01,Pune,India,24.941
2010-08-01,Izmir,Turkey,28.352


### Slicing with iloc

The `.iloc` method allows to select rows and columns by their integer-based position. So, `.iloc` only takes integer-based positional arguments and does not accept labels or conditions. If we want to select data based on labels or conditions, then stay with `.loc` method instead.

In [61]:
# Get 23rd row, 2nd column (index 22, 1)
display(temperatures.iloc[:22,:1].head(2))

Unnamed: 0,date
0,2000-01-01
1,2000-02-01


In [62]:
# Use slicing to get the first 5 rows
display(temperatures.iloc[:5,:].head(2) )

Unnamed: 0,date,city,country,avg_temp_c
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293
1,2000-02-01,Abidjan,Côte D'Ivoire,27.685


In [63]:
# Use slicing to get columns 3 to 4
display(temperatures.iloc[:, [2,3]].head(2))


Unnamed: 0,country,avg_temp_c
0,Côte D'Ivoire,27.293
1,Côte D'Ivoire,27.685


In [64]:

# Use slicing in both directions at once
display(temperatures.iloc[:5,[2,3]].head(2))

Unnamed: 0,country,avg_temp_c
0,Côte D'Ivoire,27.293
1,Côte D'Ivoire,27.685


## Working with pivot tables

How perform subsetting and calculation on pivot tables

**Pivot temperature by city and year**

It's interesting to see how temperatures for each city change over time—looking at every month results in a big table, which can be tricky to reason about. Instead, let's look at how temperatures change by year.

You can access the components of a date (year, month and day) using code of the form 
- `dataframe["column"].dt.component`: for all component of a date 
- `dataframe["column"].dt.month` : for month component 
- `dataframe["column"].dt.year` : for the year component

To transform a column in a datetime to use ´dt´ use

- `pd.to_datetime(datraframe["column"], format='%Y-%m-%d' )`

In [65]:
# define the format of date for the column 'date'
temperatures["date"] = pd.to_datetime(temperatures['date'], format='%Y-%m-%d')

# Add a year column to temperatures
temperatures["year"]= temperatures["date"].dt.year
display(temperatures.head(2))

Unnamed: 0,date,city,country,avg_temp_c,year
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293,2000
1,2000-02-01,Abidjan,Côte D'Ivoire,27.685,2000


In [66]:
# Add a year column to temperatures
temperatures["year"]=temperatures["date"].dt.year
display(temperatures.head(2))

Unnamed: 0,date,city,country,avg_temp_c,year
0,2000-01-01,Abidjan,Côte D'Ivoire,27.293,2000
1,2000-02-01,Abidjan,Côte D'Ivoire,27.685,2000


In [67]:
# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table(values="avg_temp_c", index=["country","city"], columns="year")

# See the result
display(temp_by_country_city_vs_year.head(2))

Unnamed: 0_level_0,year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Afghanistan,Kabul,15.822667,15.847917,15.714583,15.132583,16.128417,14.8475,15.7985,15.518,15.47925,15.093333,15.676,15.812167,14.510333,16.206125
Angola,Luanda,24.410333,24.427083,24.790917,24.867167,24.216167,24.414583,24.138417,24.241583,24.266333,24.325083,24.44025,24.15075,24.240083,24.553875


**Subsetting pivot tables**

A pivot table is just a DataFrame with sorted indexes, so we can use `.loc[]` to slice the pivot table.

In [68]:
# Subset for Egypt to India
display(temp_by_country_city_vs_year.loc["Egypt": "India"].head(1))

Unnamed: 0_level_0,year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Egypt,Alexandria,20.7445,21.454583,21.456167,21.221417,21.064167,21.082333,21.148167,21.50775,21.739,21.6705,22.459583,21.1815,21.552583,21.4385


In [69]:
# Subset for Egypt, Cairo to India, Delhi
display(temp_by_country_city_vs_year.loc[("Egypt", "Cairo") :("India", "Delhi")].head(1))

Unnamed: 0_level_0,year,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Egypt,Cairo,21.486167,22.330833,22.414083,22.1705,22.081917,22.0065,22.05,22.361,22.6445,22.625,23.71825,21.986917,22.48425,22.907


In [70]:
# Subset in both directions at once
display(temp_by_country_city_vs_year.loc[ ("Egypt", "Cairo") :("India", "Delhi"), 2005:2010 ].head(1) )

Unnamed: 0_level_0,year,2005,2006,2007,2008,2009,2010
country,city,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Egypt,Cairo,22.0065,22.05,22.361,22.6445,22.625,23.71825


**Calculating on a pivot table**

Pivot tables are filled with summary statistics, but they are only a first step to finding something insightful. Often you'll need to perform further calculations on them. A common thing to do is to find the rows or columns where the highest or lowest value occurs.

In [71]:
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean(axis = "rows")
display(mean_temp_by_year.head(3))

year
2000    19.506243
2001    19.679352
2002    19.855685
dtype: float64

In [72]:

# Filter for the year that had the highest mean temp
display(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max() ]) 


year
2013    20.312285
dtype: float64

In [73]:

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis= "columns")
display(mean_temp_by_city.head(3))

country      city     
Afghanistan  Kabul        15.541955
Angola       Luanda       24.391616
Australia    Melbourne    14.275411
dtype: float64

In [74]:
# Filter for the city that had the lowest mean temp
display(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min() ])

country  city  
China    Harbin    4.876551
dtype: float64