## Examining and selecting data from DataFrames

In [1]:
import pandas as pd
from IPython.display import display
from openbb import obb

In [2]:
obb.user.preferences.output_type = "dataframe"

Fetches historical price data for the equity "AAPL" starting from 2021-01-01 using the "yfinance" provider and stores it in 'df'

In [3]:
df = obb.equity.price.historical("AAPL", start_date="2021-01-01", provider="yfinance")

In [4]:
display()

Displays the first 5 rows of 'df'

In [5]:
ddf = df.head(5)

In [6]:
display()

Displays the last 5 rows of 'df'

In [7]:
ddf = df.tail(5)

In [8]:
display()

Displays the values of 'df'

In [9]:
display(df.values)

array([[1.33520004e+02, 1.33610001e+02, 1.26760002e+02, 1.29410004e+02,
        1.43301900e+08, 0.00000000e+00],
       [1.28889999e+02, 1.31740005e+02, 1.28429993e+02, 1.31009995e+02,
        9.76649000e+07, 0.00000000e+00],
       [1.27720001e+02, 1.31050003e+02, 1.26379997e+02, 1.26599998e+02,
        1.55088000e+08, 0.00000000e+00],
       ...,
       [2.34119995e+02, 2.39860001e+02, 2.34009995e+02, 2.39360001e+02,
        4.54861000e+07, 0.00000000e+00],
       [2.38669998e+02, 2.40789993e+02, 2.37210007e+02, 2.37589996e+02,
        5.56583000e+07, 0.00000000e+00],
       [2.47190002e+02, 2.47190002e+02, 2.33440002e+02, 2.36000000e+02,
        1.00959800e+08, 0.00000000e+00]], shape=(1025, 6))

Displays the statistical summary of 'df'

In [10]:
ddf = df.describe()

In [11]:
display()

Transposes 'df' and displays it

In [12]:
ddf = df.T

In [13]:
display()

Displays the column names of 'df'

In [14]:
display(df.columns)

Index(['open', 'high', 'low', 'close', 'volume', 'dividend'], dtype='object')

Updates the column names of 'df'

In [15]:
df.columns = [
    "open",
    "high",
    "low",
    "close",
    "volume",
    "dividends",
    "splits",
]

ValueError: Length mismatch: Expected axis has 6 elements, new values have 7 elements

In [16]:
display(df)

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-04,133.520004,133.610001,126.760002,129.410004,143301900,0.0
2021-01-05,128.889999,131.740005,128.429993,131.009995,97664900,0.0
2021-01-06,127.720001,131.050003,126.379997,126.599998,155088000,0.0
2021-01-07,128.360001,131.630005,127.860001,130.919998,109578200,0.0
2021-01-08,132.429993,132.630005,130.229996,132.050003,105158200,0.0
...,...,...,...,...,...,...
2025-01-27,224.020004,232.149994,223.979996,229.860001,94863400,0.0
2025-01-28,230.850006,240.190002,230.809998,238.259995,75707600,0.0
2025-01-29,234.119995,239.860001,234.009995,239.360001,45486100,0.0
2025-01-30,238.669998,240.789993,237.210007,237.589996,55658300,0.0


Accesses the 'close' column using two different methods

In [17]:
df["close"]
df.close

date
2021-01-04    129.410004
2021-01-05    131.009995
2021-01-06    126.599998
2021-01-07    130.919998
2021-01-08    132.050003
                 ...    
2025-01-27    229.860001
2025-01-28    238.259995
2025-01-29    239.360001
2025-01-30    237.589996
2025-01-31    236.000000
Name: close, Length: 1025, dtype: float64

Slices the first three rows of 'df'

In [18]:
df[0:3]

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-04,133.520004,133.610001,126.760002,129.410004,143301900,0.0
2021-01-05,128.889999,131.740005,128.429993,131.009995,97664900,0.0
2021-01-06,127.720001,131.050003,126.379997,126.599998,155088000,0.0


Slices 'df' by date range (inclusive of the last value) after converting the index to datetime

In [19]:
df.index = pd.to_datetime(df.index)
df["2021-01-02":"2021-01-11"]

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-04,133.520004,133.610001,126.760002,129.410004,143301900,0.0
2021-01-05,128.889999,131.740005,128.429993,131.009995,97664900,0.0
2021-01-06,127.720001,131.050003,126.379997,126.599998,155088000,0.0
2021-01-07,128.360001,131.630005,127.860001,130.919998,109578200,0.0
2021-01-08,132.429993,132.630005,130.229996,132.050003,105158200,0.0
2021-01-11,129.190002,130.169998,128.5,128.979996,100384500,0.0


Displays the index (dates) of 'df'

In [20]:
dates = df.index

In [21]:
display()

Accesses the first date in the index

In [22]:
dates[0]

Timestamp('2021-01-04 00:00:00')

Accesses the row corresponding to the first date in the index

In [23]:
df.loc[df.index[0]]

open        1.335200e+02
high        1.336100e+02
low         1.267600e+02
close       1.294100e+02
volume      1.433019e+08
dividend    0.000000e+00
Name: 2021-01-04 00:00:00, dtype: float64

Accesses the 'close' value for the first date in the index

In [24]:
df.loc[df.index[0], "close"]

np.float64(129.41000366210938)

Accesses the 'open' and 'close' values for the first date in the index

In [25]:
df.loc[df.index[0], ["open", "close"]]

open     133.520004
close    129.410004
Name: 2021-01-04 00:00:00, dtype: float64

Slices the first six rows and selects the 'open' and 'close' columns

In [26]:
df.loc[df.index[0:6], ["open", "close"]]

Unnamed: 0_level_0,open,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-04,133.520004,129.410004
2021-01-05,128.889999,131.009995
2021-01-06,127.720001,126.599998
2021-01-07,128.360001,130.919998
2021-01-08,132.429993,132.050003
2021-01-11,129.190002,128.979996


Slices 'df' by date range and selects the 'open' and 'close' columns

In [27]:
df.loc["2021-01-02":"2021-01-11", ["open", "close"]]

Unnamed: 0_level_0,open,close
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-04,133.520004,129.410004
2021-01-05,128.889999,131.009995
2021-01-06,127.720001,126.599998
2021-01-07,128.360001,130.919998
2021-01-08,132.429993,132.050003
2021-01-11,129.190002,128.979996


Accesses the fourth row of 'df' using integer location

In [28]:
df.iloc[3]

open        1.283600e+02
high        1.316300e+02
low         1.278600e+02
close       1.309200e+02
volume      1.095782e+08
dividend    0.000000e+00
Name: 2021-01-07 00:00:00, dtype: float64

Slices the third and fourth rows and the first two columns using integer location

In [29]:
df.iloc[3:5, 0:2]

Unnamed: 0_level_0,open,high
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-07,128.360001,131.630005
2021-01-08,132.429993,132.630005


Selects specific rows and columns by integer position

In [30]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0_level_0,open,low
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-05,128.889999,128.429993
2021-01-06,127.720001,126.379997
2021-01-08,132.429993,130.229996


Slices rows explicitly using integer location

In [31]:
df.iloc[1:3, :]

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-05,128.889999,131.740005,128.429993,131.009995,97664900,0.0
2021-01-06,127.720001,131.050003,126.379997,126.599998,155088000,0.0


Slices columns explicitly using integer location

In [32]:
df.iloc[:, 1:3]

Unnamed: 0_level_0,high,low
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2021-01-04,133.610001,126.760002
2021-01-05,131.740005,128.429993
2021-01-06,131.050003,126.379997
2021-01-07,131.630005,127.860001
2021-01-08,132.630005,130.229996
...,...,...
2025-01-27,232.149994,223.979996
2025-01-28,240.190002,230.809998
2025-01-29,239.860001,234.009995
2025-01-30,240.789993,237.210007


Accesses a specific value using integer location

In [33]:
df.iloc[1, 1]

np.float64(131.74000549316406)

Accesses a specific value using fast access method

In [34]:
df.iat[1, 1]

np.float64(131.74000549316406)

Uses boolean indexing to select data where 'close' is greater than the mean 'close' value

In [35]:
df.close > df.close.mean()

date
2021-01-04    False
2021-01-05    False
2021-01-06    False
2021-01-07    False
2021-01-08    False
              ...  
2025-01-27     True
2025-01-28     True
2025-01-29     True
2025-01-30     True
2025-01-31     True
Name: close, Length: 1025, dtype: bool

In [36]:
df[df.close > df.close.mean()]

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-12-07,169.080002,171.580002,168.339996,171.179993,120405400,0.0
2021-12-08,172.130005,175.960007,170.699997,175.080002,116998900,0.0
2021-12-09,174.910004,176.750000,173.919998,174.559998,108923700,0.0
2021-12-10,175.210007,179.630005,174.690002,179.449997,115402700,0.0
2021-12-13,181.119995,182.130005,175.529999,175.740005,153237000,0.0
...,...,...,...,...,...,...
2025-01-27,224.020004,232.149994,223.979996,229.860001,94863400,0.0
2025-01-28,230.850006,240.190002,230.809998,238.259995,75707600,0.0
2025-01-29,234.119995,239.860001,234.009995,239.360001,45486100,0.0
2025-01-30,238.669998,240.789993,237.210007,237.589996,55658300,0.0


Selects the first column where 'close' is greater than the mean 'close' value

In [37]:
df[df.close > df.close.mean()].iloc[:, 0]

date
2021-12-07    169.080002
2021-12-08    172.130005
2021-12-09    174.910004
2021-12-10    175.210007
2021-12-13    181.119995
                 ...    
2025-01-27    224.020004
2025-01-28    230.850006
2025-01-29    234.119995
2025-01-30    238.669998
2025-01-31    247.190002
Name: open, Length: 471, dtype: float64

Uses multiple conditions to filter the DataFrame

In [38]:
df.loc[
    (df.close > df.close.mean())
    & (df.close.mean() > 100)
    & (df.volume > df.volume.mean())
]

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-12-07,169.080002,171.580002,168.339996,171.179993,120405400,0.0
2021-12-08,172.130005,175.960007,170.699997,175.080002,116998900,0.0
2021-12-09,174.910004,176.750000,173.919998,174.559998,108923700,0.0
2021-12-10,175.210007,179.630005,174.690002,179.449997,115402700,0.0
2021-12-13,181.119995,182.130005,175.529999,175.740005,153237000,0.0
...,...,...,...,...,...,...
2024-12-20,248.039993,255.000000,245.690002,254.490005,147495300,0.0
2025-01-21,224.000000,224.419998,219.380005,222.639999,98070400,0.0
2025-01-27,224.020004,232.149994,223.979996,229.860001,94863400,0.0
2025-01-28,230.850006,240.190002,230.809998,238.259995,75707600,0.0


In [39]:
display(df)

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-04,133.520004,133.610001,126.760002,129.410004,143301900,0.0
2021-01-05,128.889999,131.740005,128.429993,131.009995,97664900,0.0
2021-01-06,127.720001,131.050003,126.379997,126.599998,155088000,0.0
2021-01-07,128.360001,131.630005,127.860001,130.919998,109578200,0.0
2021-01-08,132.429993,132.630005,130.229996,132.050003,105158200,0.0
...,...,...,...,...,...,...
2025-01-27,224.020004,232.149994,223.979996,229.860001,94863400,0.0
2025-01-28,230.850006,240.190002,230.809998,238.259995,75707600,0.0
2025-01-29,234.119995,239.860001,234.009995,239.360001,45486100,0.0
2025-01-30,238.669998,240.789993,237.210007,237.589996,55658300,0.0


Selects rows from the year 2023

In [40]:
df.loc["2023"]

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-01-03,130.279999,130.899994,124.169998,125.070000,112117500,0.0
2023-01-04,126.889999,128.660004,125.080002,126.360001,89113600,0.0
2023-01-05,127.129997,127.769997,124.760002,125.019997,80962700,0.0
2023-01-06,126.010002,130.289993,124.889999,129.619995,87754700,0.0
2023-01-09,130.470001,133.410004,129.889999,130.149994,70790800,0.0
...,...,...,...,...,...,...
2023-12-22,195.179993,195.410004,192.970001,193.600006,37122800,0.0
2023-12-26,193.610001,193.889999,192.830002,193.050003,28919300,0.0
2023-12-27,192.490005,193.500000,191.089996,193.149994,48087700,0.0
2023-12-28,194.139999,194.660004,193.169998,193.580002,34049900,0.0


Selects rows from July 2023

In [41]:
df.loc["2023-07"]

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-07-03,193.779999,193.880005,191.759995,192.460007,31458200,0.0
2023-07-05,191.570007,192.979996,190.619995,191.330002,46920300,0.0
2023-07-06,189.839996,192.020004,189.199997,191.809998,45094300,0.0
2023-07-07,191.410004,192.669998,190.240005,190.679993,46778000,0.0
2023-07-10,189.259995,189.990005,187.039993,188.610001,59922200,0.0
2023-07-11,189.160004,189.300003,186.600006,188.080002,46638100,0.0
2023-07-12,189.679993,191.699997,188.470001,189.770004,60750200,0.0
2023-07-13,190.5,191.190002,189.779999,190.539993,41342300,0.0
2023-07-14,190.229996,191.179993,189.630005,190.690002,41573900,0.0
2023-07-17,191.899994,194.320007,191.809998,193.990005,50520200,0.0


Accesses the 'close' value on 2023-07-12

In [42]:
df.at["2023-07-12", "close"]

np.float64(189.77000427246094)

Selects the top 5 rows with the highest 'volume'

In [43]:
df.nlargest(5, "volume")

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-09-20,229.970001,233.089996,227.619995,228.199997,318679900,0.0
2024-06-21,210.389999,211.889999,207.110001,207.490005,246421400,0.0
2024-06-12,207.369995,220.199997,206.899994,213.070007,198134300,0.0
2021-12-17,169.929993,173.470001,169.690002,171.139999,195432700,0.0
2021-03-19,119.900002,121.43,119.68,119.989998,185549500,0.0


Queries the DataFrame to select rows where 'close' is greater than 'open'

In [44]:
df.query("close > open")

Unnamed: 0_level_0,open,high,low,close,volume,dividend
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2021-01-05,128.889999,131.740005,128.429993,131.009995,97664900,0.0
2021-01-07,128.360001,131.630005,127.860001,130.919998,109578200,0.0
2021-01-12,128.500000,129.690002,126.860001,128.800003,91951100,0.0
2021-01-13,128.759995,131.449997,128.490005,130.889999,88636800,0.0
2021-01-19,127.779999,128.710007,126.940002,127.830002,90757300,0.0
...,...,...,...,...,...,...
2025-01-15,234.639999,238.960007,234.429993,237.869995,39832000,0.0
2025-01-22,219.789993,224.119995,219.789993,223.830002,64126500,0.0
2025-01-27,224.020004,232.149994,223.979996,229.860001,94863400,0.0
2025-01-28,230.850006,240.190002,230.809998,238.259995,75707600,0.0


**Jason Strimpel** is the founder of <a href='https://pyquantnews.com/'>PyQuant News</a> and co-founder of <a href='https://www.tradeblotter.io/'>Trade Blotter</a>. His career in algorithmic trading spans 20+ years. He previously traded for a Chicago-based hedge fund, was a risk manager at JPMorgan, and managed production risk technology for an energy derivatives trading firm in London. In Singapore, he served as APAC CIO for an agricultural trading firm and built the data science team for a global metals trading firm. Jason holds degrees in Finance and Economics and a Master's in Quantitative Finance from the Illinois Institute of Technology. His career spans America, Europe, and Asia. He shares his expertise through the <a href='https://pyquantnews.com/subscribe-to-the-pyquant-newsletter/'>PyQuant Newsletter</a>, social media, and has taught over 1,000+ algorithmic trading with Python in his popular course **<a href='https://gettingstartedwithpythonforquantfinance.com/'>Getting Started With Python for Quant Finance</a>**. All code is for educational purposes only. Nothing provided here is financial advise. Use at your own risk.