# Introduction to pandas

## Why use pandas?

*Addresses some shortcomings of NumPy:*

1.  Data is organized in *variables* and *observations*

2.  Each variable is permitted to have a *different* data type (integers, floats, strings, ...)

3.  Can select observations based on *labels* (e.g., time or date)

4.  Supports aggregation & reduction functions applied to *subsets* of data

5.  Many convenient data import / export functions

## Why not?

1.  NumPy is faster for low-level computing on homogenous data
2.  Pandas can consume lots of memory
3.  Pandas can be slow with large data sets (millions of observations)

## Resources

- [pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)


***
## Creating pandas data structures

Pandas has two main data structures:

1.  [`Series`](https://pandas.pydata.org/docs/reference/series.html): 
    observations of a *single* variable.
2.  [`DataFrame`](https://pandas.pydata.org/docs/reference/frame.html): 
    container for *several* variables.

*Example: Create Series from 1-dimensional NumPy array*

In [2]:
# Need to import pandas before usage
import pandas as pd

# Import NumPy to create some demo data
import numpy as np

In [7]:
# Create some date with NumPy
data = np.arange(5, 10)

In [4]:
data

array([0, 1, 2, 3, 4])

In [8]:
series = pd.Series(data)

In [9]:
series

0    5
1    6
2    7
3    8
4    9
dtype: int64

*Example: Create DataFrame from NumPy array*

-   We can create 2-dimensional arrays from 1-dimensional ones with 
    [`reshape()`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html)

In [12]:
# Want to create a 5-by-3 matrix of integers
data = np.arange(15).reshape((5, 3))

In [13]:
data

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])

In [19]:
df = pd.DataFrame(data, columns=['A','B','C'])

In [17]:
df

Unnamed: 0,A,B,C
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11
4,12,13,14


*Example: Create DataFrame with non-homogenous data from dictionary*

In [20]:
import numpy as np
import pandas as pd

# Names (strings)
names = ['Alice', 'Bob']

# Birth dates (datetime objects)
bdates = pd.to_datetime(['1985-01-01', '1997-05-12'])

# Incomes (floats)
incomes = np.array([600000, np.nan])         # code missing income as NaN

In [21]:
df = pd.DataFrame({'name': names, 'birthday': bdates, 'income': incomes})

In [22]:
df

Unnamed: 0,name,birthday,income
0,Alice,1985-01-01,600000.0
1,Bob,1997-05-12,


<div class="alert alert-info">
<h3> Your turn</h3>

Create a pandas <tt>Series</tt> which contains the characters <tt>'a'</tt>, <tt>'b'</tt>, and <tt>'c'</tt>.

</div>

In [28]:
pd.Series(['A','B','C'])

0    A
1    B
2    C
dtype: object

***
## Importing data

### Loading data with NumPy & its limitations (optional)

-   See final lecture notebook if you are interested

***
### Loading data with Pandas

The most important input/output functions are:

-   [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), 
    [`to_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html): 
    Read or write CSV text files.
-   [`read_fwf()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_fwf.html): 
    Read data with fixed field widths, i.e., text data
    that does not use delimiters to separate fields.
-   [`read_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html), 
    [`to_excel()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html): 
    Read or write Excel spreadsheets.
-   [`read_stata()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_stata.html), 
    [`to_stata()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_stata.html): 
    Read or write Stata's `.dta` files.

*Example: Load data using `read_csv()`*

In [29]:
# Uncomment this to use files in the local data/ directory
DATA_PATH = '../../data'

filename = f'{DATA_PATH}/FRED/FRED_annual.csv'

df = pd.read_csv(filename, sep=',')

In [30]:
df

Unnamed: 0,Year,GDP,CPI,UNRATE,FEDFUNDS,INFLATION
0,1954,2877.7,26.9,5.6,1.0,
1,1955,3083.0,26.8,4.4,1.8,-0.4
2,1956,3148.8,27.2,4.1,2.7,1.5
3,1957,3215.1,28.1,4.3,3.1,3.3
4,1958,3191.2,28.9,6.8,1.6,2.8
...,...,...,...,...,...,...
66,2020,20267.6,258.9,8.1,0.4,1.3
67,2021,21494.8,271.0,5.4,0.1,4.7
68,2022,22034.8,292.6,3.6,1.7,8.0
69,2023,22671.1,304.7,3.6,5.0,4.1


<div class="alert alert-info">
<h3> Your turn</h3>
Use the pandas functions listed above to import data from the following files located in the <TT>data</TT> folder:
<ol>
    <li>titanic.csv</li>
    <li>FRED/FRED_annual.xlsx</li>
</ol>

To load Excel files, you need to have the package <TT>openpyxl</TT> installed.
</div>

In [None]:
filename = f'{DATA_PATH}/titanic.csv'

df = pd.read_csv(filename)

In [34]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0000,C148,C


In [35]:
filename2 = f'{DATA_PATH}/FRED/FRED_annual.xlsx'

df2 = pd.read_excel(filename2)

In [36]:
df2

Unnamed: 0,Year,GDP,CPI,UNRATE,FEDFUNDS,INFLATION
0,1954,2877.7,26.9,5.6,1.0,
1,1955,3083.0,26.8,4.4,1.8,-0.371747
2,1956,3148.8,27.2,4.1,2.7,1.492537
3,1957,3215.1,28.1,4.3,3.1,3.308824
4,1958,3191.2,28.9,6.8,1.6,2.846975
...,...,...,...,...,...,...
65,2019,20715.7,255.7,3.7,2.2,1.831939
66,2020,20267.6,258.8,8.1,0.4,1.212358
67,2021,21494.8,271.0,5.4,0.1,4.714065
68,2022,22034.8,292.6,3.6,1.7,7.970480


***
## Viewing data

Methods for inspecting (parts of) a DataFrame:

- [`info()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html): print information about observation count, columns, and data types
- [`head()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html): print the first few rows
- [`tail()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html): print the last few rows
- [`describe()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html): print summary statistics for *numerical* data
- [`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.value_counts.html): tabulate observation counts for categorical data



*Example: Load and view Titanic data set*

Columns present in the file `titanic.csv`:

1.  `PassengerId`
2.  `Survived`: indicator whether the person survived
3.  `Pclass`: accommodation class (first, second, third)
4.  `Name`: Name of passenger (last name, first name)
5.  `Sex`: `male` or `female`
6.  `Age`
7.  `Ticket`: Ticket number
8.  `Fare`: Fare in pounds
9.  `Cabin`: Deck + cabin number
10. `Embarked`: Port at which passenger embarked:
    `C` - Cherbourg, `Q` - Queenstown, `S` - Southampton

In [37]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   Ticket       891 non-null    object 
 7   Fare         891 non-null    float64
 8   Cabin        204 non-null    object 
 9   Embarked     889 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 69.7+ KB


In [39]:
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.925,,S


In [40]:
df.tail(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
888,889,0,3,"Johnston, Miss Catherine Helen ""Carrie""",female,,W./C. 6607,23.45,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,111369,30.0,C148,C
890,891,0,3,"Dooley, Mr. Patrick",male,32.0,370376,7.75,,Q


In [41]:
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,Fare
count,891.0,891.0,891.0,714.0,891.0
mean,446.0,0.383838,2.308642,29.699118,32.204208
std,257.353842,0.486592,0.836071,14.526497,49.693429
min,1.0,0.0,1.0,0.42,0.0
25%,223.5,0.0,2.0,20.125,7.9104
50%,446.0,0.0,3.0,28.0,14.4542
75%,668.5,1.0,3.0,38.0,31.0
max,891.0,1.0,3.0,80.0,512.3292


In [42]:
df['Sex'].value_counts()

Sex
male      577
female    314
Name: count, dtype: int64

<div class="alert alert-info">
<h3> Your turn</h3>
Using the Titanic data set, tabulate the number of passengers by the port in which they boarded the ship
(variable <tt>Embarked</tt>). How many observations have missing values for this variable?
</div>

In [48]:
df['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [51]:
len(df) - df['Embarked'].value_counts().sum()

np.int64(2)

In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   Ticket       891 non-null    object 
 7   Fare         891 non-null    float64
 8   Cabin        204 non-null    object 
 9   Embarked     889 non-null    object 
dtypes: float64(2), int64(3), object(5)
memory usage: 69.7+ KB


***
## Indexing

Pandas supports two types of indexing:

1.  Indexing by position (same as Python containers and NumPy arrays)
2.  Indexing by label, i.e., by the values assigned to the row or column *index*.
    
Pandas indexing is performed either by using brackets `[]`, or by using
`.loc[]` for label indexing, or `.iloc[]` for positional indexing.

Indexing via `[]` can be somewhat confusing:

-   specifying `df['name']` returns the column `name` as a `Series` object.
-   specifying a range such as `df[5:10]` returns the *rows*
    associated with the *positions* 5,...,9.

    **Recommendation:** Don't use this, there are less confusing ways to select rows.

In [53]:
filename = f'{DATA_PATH}/titanic.csv'

df = pd.read_csv(filename)

*Example: Selecting a single column*

-   Select column `'Name'` from Titanic data set

In [54]:
df['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                  Heikkinen, Miss Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                          Graham, Miss Margaret Edith
888              Johnston, Miss Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

*Example: Selecting multiple columns*

-   Select columns `'Name'` and `'Sex'` from Titanic data set
-   Need to specify multiple columns as `list`

In [59]:
df[['Name','Sex']]

Unnamed: 0,Name,Sex
0,"Braund, Mr. Owen Harris",male
1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female
2,"Heikkinen, Miss Laina",female
3,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female
4,"Allen, Mr. William Henry",male
...,...,...
886,"Montvila, Rev. Juozas",male
887,"Graham, Miss Margaret Edith",female
888,"Johnston, Miss Catherine Helen ""Carrie""",female
889,"Behr, Mr. Karl Howell",male


***
### Creating and manipulating indices


Three main methods to create/manipulate indices:

1.   Create a new `Series` or `DataFrame` object with a custom index
    using the `index` argument.
2.   [`set_index(keys=['column1', ...])`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html)
    uses the values of `column1`
    and optionally additional columns as indices, discarding the current index.
3.   [`reset_index()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) 
    resets the index to its default value, a sequence
    of increasing integers starting at 0.

#### Creating custom indices

*Example: Create `Series` with custom index*

In [60]:
series = pd.Series(np.arange(3), index=['a','b','c'])
series

a    0
b    1
c    2
dtype: int64

#### Manipulating indices


-   By default, this creates a *new* `Series` or `DataFrame`, unless `inplace=True` is specified.

*Example: Set DataFrame index from column*

-   Use the `set_index()` method
-   Optionally specify `append=True` to add as *additional* index levels

In [61]:
# Create demo DataFrame
import pandas as pd

df = pd.DataFrame({'A': [10, 20, 30], 'B': ['a', 'b', 'c']})

In [69]:
df.set_index('B', inplace=True)

In [70]:
df

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,10
b,20
c,30



*Example: Reset DataFrame index*

-   Use the `reset_index()` method
-   Optionally specify `drop=True`, otherwise the old index is added as a column to the DataFrame

In [74]:
df.reset_index(drop=True)

Unnamed: 0,A
0,10
1,20
2,30


<div class="alert alert-info">
<h3> Your turn</h3>
Read in the following data files from the <TT>data/FRED</TT> folder and manipulate the dataframe index:
<ol>
    <li>Read in the file <TT>FRED_annual.csv</TT> and set the column <TT>Year</TT> as the index.</li>
    <li>Read in the file <TT>FRED_monthly.csv</TT> and set the columns <TT>Year</TT> and <TT>Month</TT> as the index</li>
</ol>
Perform the tasks using <TT>inplace=False</TT> and <TT>inplace=True</TT>. What's the difference?

Restore the original (default) index after you are done.
</div>

In [100]:
fn2 = f'{DATA_PATH}/FRED/FRED_annual.csv'
filename3 = f'{DATA_PATH}/FRED/FRED_monthly.csv'

df_annual = pd.read_csv(fn2)
df_monthly = pd.read_csv(filename3)

In [101]:
df_annual.set_index('Year', inplace=True)

In [102]:
df_monthly.set_index(['Year','Month'], inplace=True)

***
### Selecting elements

Recommended rules for indexing:

1.  Use `df['name']` only to select *columns* and nothing else
2.  Use [`.loc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to select by label
3.  Use [`.iloc[]`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) to select by position

*Demo data set used for this section:*

In [104]:
# Create demo data with 3 columns and 5 rows

# Column labels
columns = ['X', 'Y', 'Z']
# Row labels
rows = ['a', 'b', 'c', 'd', 'e']

values = np.arange(len(rows))

# Create data dictionary using a list comprehension
data = {col: [f'{col}{val}' for val in values] for col in columns}

# Create DataFrame from dictionary
df = pd.DataFrame(data, index=rows)

In [105]:
df

Unnamed: 0,X,Y,Z
a,X0,Y0,Z0
b,X1,Y1,Z1
c,X2,Y2,Z2
d,X3,Y3,Z3
e,X4,Y4,Z4


**Selection by label**

-   Use `.loc[]` to select rows and/or columns *by label*
-   Can use *slicing* where last element is *included*

In [110]:
df.loc['b':'e', ['X', 'Y']]

Unnamed: 0,X,Y
b,X1,Y1
c,X2,Y2
d,X3,Y3
e,X4,Y4


**Selection by position**

-   Use `.iloc[]` to select rows and/or columns *by position*

In [113]:
df.iloc[0]

X    X0
Y    Y0
Z    Z0
Name: a, dtype: object

**Boolean indexing**

-   Select elements based on whether some condition is true
-   Works with `[]`, with `.loc[]`, and with `.iloc[]` (for some reason)

In [115]:
filename = f'{DATA_PATH}/titanic.csv'

df = pd.read_csv(filename)

*Example: Boolean indexing with Titanic data*

- Select all rows of passengers who embarked in Southampton (`'Embarked'` equals `'S'`)

In [119]:
df[df['Embarked'] == 'S']

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,17463,51.8625,E46,S
...,...,...,...,...,...,...,...,...,...,...
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,SOTON/OQ 392076,7.0500,,S
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,211536,13.0000,,S
887,888,1,1,"Graham, Miss Margaret Edith",female,19.0,112053,30.0000,B42,S


*Example: Multiple conditions with logical and/or*

- Select all rows of *male* passengers (`'Sex'` equals `'male'`) who embarked in Southampton (`'Embarked'` equals `'S'`)

In [122]:
condition = (df['Embarked'] == 'S') & (df['Sex'] == 'male')

In [123]:
df[condition]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,A/5 21171,7.2500,,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,373450,8.0500,,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master Gosta Leonard",male,2.0,349909,21.0750,,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,A/5. 2151,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...
878,879,0,3,"Laleff, Mr. Kristo",male,,349217,7.8958,,S
881,882,0,3,"Markun, Mr. Johann",male,33.0,349257,7.8958,,S
883,884,0,2,"Banfield, Mr. Frederick James",male,28.0,C.A./SOTON 34068,10.5000,,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,SOTON/OQ 392076,7.0500,,S


*Example: Using `isin()`*

- Select all rows of passengers who embarked either in Southampton or Queenstown (`'Embarked'` equals `'S'` of `'Q'`)

In [124]:
df['Embarked'].isin(('S', 'Q'))

0       True
1      False
2       True
3       True
4       True
       ...  
886     True
887     True
888     True
889    False
890     True
Name: Embarked, Length: 891, dtype: bool

*Example: Using `query()`*

- Select all rows of passengers who embarked in Southampton (`'Embarked'` equals `'S'`) and are older than 70

In [125]:
df.query('Embarked == "S" & Age > 70')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
630,631,1,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,27042,30.0,A23,S
851,852,0,3,"Svensson, Mr. Johan",male,74.0,347060,7.775,,S


<div class="alert alert-info">
<h3> Your turn</h3>
Load the Titanic passenger data set <TT>data/titanic.csv</TT> and select the following subsets of data:
<ol>
    <li>Select all passengers with passenger IDs from 10 to 20</li>
    <li>Select the 10th to 20th (inclusive) row of the dataframe</li>
    <li>Using <TT>query()</TT>, select the sub-sample of female passengers aged 30 to 40. Display only the columns <TT>Name</TT>, <TT>Age</TT>, and <TT>Sex</TT> (in that order)</li>
    <li>Repeat the last exercise without using <TT>query()</TT></li>
    <li>Select all men who embarked in Queenstown or Cherbourg</li>
</ol>
</div>

In [129]:
df.loc[10:20]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
10,11,1,3,"Sandstrom, Miss Marguerite Rut",female,4.0,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss Elizabeth",female,58.0,113783,26.55,C103,S
12,13,0,3,"Saundercock, Mr. William Henry",male,20.0,A/5. 2151,8.05,,S
13,14,0,3,"Andersson, Mr. Anders Johan",male,39.0,347082,31.275,,S
14,15,0,3,"Vestrom, Miss Hulda Amanda Adolfina",female,14.0,350406,7.8542,,S
15,16,1,2,"Hewlett, Mrs. (Mary D Kingcome)",female,55.0,248706,16.0,,S
16,17,0,3,"Rice, Master Eugene",male,2.0,382652,29.125,,Q
17,18,1,2,"Williams, Mr. Charles Eugene",male,,244373,13.0,,S
18,19,0,3,"Vander Planke, Mrs. Julius (Emelia Maria Vande...",female,31.0,345763,18.0,,S
19,20,1,3,"Masselmani, Mrs. Fatima",female,,2649,7.225,,C


In [151]:
df.query('Sex == "female" & Age == 30')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
79,80,1,3,"Dowdell, Miss Elizabeth",female,30.0,364516,12.475,,S
257,258,1,1,"Cherry, Miss Gladys",female,30.0,110152,86.5,B77,S
309,310,1,1,"Francatelli, Miss Laura Mabel",female,30.0,PC 17485,56.9292,E36,C
322,323,1,2,"Slayter, Miss Hilda Mary",female,30.0,234818,12.35,,Q
520,521,1,1,"Perreault, Miss Anne",female,30.0,12749,93.5,B73,S
534,535,0,3,"Cacic, Miss Marija",female,30.0,315084,8.6625,,S
537,538,1,1,"LeRoy, Miss Bertha",female,30.0,PC 17761,106.425,,C
726,727,1,2,"Renouf, Mrs. Peter Henry (Lillian Jefferys)",female,30.0,31027,21.0,,S
747,748,1,2,"Sinkkonen, Miss Anna",female,30.0,250648,13.0,,S
799,800,0,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,345773,24.15,,S


In [None]:
df.loc[10]

KeyError: "None of [Index([                                                  32,\n                                                          1,\n                                                          5,\n       'Sandstrom, Miss Marguerite RutFynney, Mr. Joseph J',\n                                               'femalemale',\n                                                       39.0,\n                                            'PP 9549239865',\n                                                       42.7,\n                                                        nan,\n                                                       'SS'],\n      dtype='object')] are in the [columns]"

***
## Working with time series data

-   Pandas indices can be date or datetime data types
-   Use [`date_range()`](https://pandas.pydata.org/docs/reference/api/pandas.date_range.html) to create a range of dates
-   Use [`to_datetime()`](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html) to convert existing data to datetime format

*Example: Creating a date index*

-   Create a demo data set of daily observations for the first 3 months of 2024

In [None]:
# Start and end dates used for demo data set
start = '2024-01-01'
end = '2024-03-31'

*Example: Select particular date*

-   Select observation from January 1, 2024

*Example: Select date range*

- Select first 5 days in January 2024

*Example: Use a partial index*

- Select all of January 2024

### Lags, differences, and other useful transformations

Methods to shift/difference observations along time dimension:

- [`shift()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html): creates leads/lags
- [`diff()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.diff.html): computes absolute differences over given period
- [`pct_change()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pct_change.html): computes relative differences over given period

<div class="alert alert-info">
<h3> Your turn</h3>
Use the data from the <TT>data/FRED</TT> folder to perform the following task:
<ol>
    <li>Read in the file <TT>FRED_annual.csv</TT> and set the column <TT>Year</TT> as the index.</li>
    <li>Compute annual inflation as the percentage change of the consumer price index (column <tt>CPI</tt>).</li>
</ol>
</div>