# Data Analysis

Now that we've got clean data, let's start with some basic financial analysis.

First, let's load our CSV file into a DataFrame, covert our dates, set the index, and check for duplicated rows or missing values.

In [None]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/ImperialCollegeLondon/efds-ta-python/refs/heads/main/data/AAPL_2024_clean.csv")
df["Date"] = pd.to_datetime(df["Date"])
df = df.set_index("Date").sort_index().drop_duplicates()

## Returns

Returns refer to the gain or loss made on an initial investment, often expressed as a percentage. We can use the generic **percentage change** formula here:

$$ (price_{end} - price_{start}) / price_{start} $$

We can apply this to close prices to calculate the simple daily return:

$$ (close price_{today} - close price_{yesterday}) / close price_{yesterday} $$

When available, use the Adjusted Close price, which takes into account corporate actions (dividends for example).

In [2]:
jan31_closing = df.loc["2024-01-31", "Adj Close"]
jan30_closing = df.loc["2024-01-30", "Adj Close"]

jan31_return = (jan31_closing - jan30_closing) / jan30_closing
print(f"Return on 1 May was {jan31_return:.2%}")

Return on 1 May was -1.94%


This simple daily return expresses a loss in value of 1.94% from one day to the next. Notice we leave our return in decimal form, but when we output it we use `f-strings` and `:.2%` to display it as a percentage.

If we wanted to use the above approach to calculate daily returns for each day in our data set, it would take a long time. Let's see how we can use pandas `pct_change()` to make this sort of work easy, by applying our percentage change formula one column at a time.

In [3]:
# Create a new column and populate it with daily returns
df['Daily Return'] = df['Adj Close'].pct_change()
df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Daily Return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2024-01-02,187.149994,188.440002,183.889999,185.639999,185.403412,82488700,
2024-01-03,184.220001,185.880005,183.429993,184.250000,184.015198,58414500,-0.007488
2024-01-04,182.149994,183.089996,180.880005,181.910004,181.678177,71983600,-0.012700
2024-01-05,181.990005,182.759995,180.169998,181.179993,180.949097,62303300,-0.004013
2024-01-08,182.089996,185.600006,181.500000,185.559998,185.323517,59144500,0.024175
...,...,...,...,...,...,...,...
2024-05-23,190.979996,191.000000,186.630005,186.880005,186.880005,51005900,-0.014424
2024-05-24,188.820007,189.979996,188.039993,189.979996,189.979996,36294600,0.016588
2024-05-28,191.509995,193.000000,189.100006,189.990005,189.990005,52280100,0.000053
2024-05-29,189.610001,192.250000,189.509995,190.289993,190.289993,53068000,0.001579


Let's now calculate cumulative returns for the period. Instead of comparing a given day with the day before it, cumulative returns compare a given day with the first day of the period, to indicate how our stock has performed since our initial investment.

We generally fill missing daily returns with a 0, which indicates no change with the day before.

Because we're doing cumulative multiplication, we'll add 1 to the adjusted closing price, so we can compound the return over time. It is generally a good idea to subtract the 1 afterwards, so you clearly see the expected return.

In [4]:
df["Cumulative Return"] = (1 + df["Daily Return"]).cumprod() - 1 

df

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Daily Return,Cumulative Return
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2024-01-02,187.149994,188.440002,183.889999,185.639999,185.403412,82488700,,
2024-01-03,184.220001,185.880005,183.429993,184.250000,184.015198,58414500,-0.007488,-0.007488
2024-01-04,182.149994,183.089996,180.880005,181.910004,181.678177,71983600,-0.012700,-0.020093
2024-01-05,181.990005,182.759995,180.169998,181.179993,180.949097,62303300,-0.004013,-0.024025
2024-01-08,182.089996,185.600006,181.500000,185.559998,185.323517,59144500,0.024175,-0.000431
...,...,...,...,...,...,...,...,...
2024-05-23,190.979996,191.000000,186.630005,186.880005,186.880005,51005900,-0.014424,0.007964
2024-05-24,188.820007,189.979996,188.039993,189.979996,189.979996,36294600,0.016588,0.024684
2024-05-28,191.509995,193.000000,189.100006,189.990005,189.990005,52280100,0.000053,0.024738
2024-05-29,189.610001,192.250000,189.509995,190.289993,190.289993,53068000,0.001579,0.026356


### Exercise: Buy & Hold

#### Part 1

Imagine you had bought AAPL at the start of 2024. What would your expected return have been had you sold at the end of April, compared with holding until the end of May and selling then?

In [5]:
print(f'''
    Selling on 30 Apr {df["Cumulative Return"].loc["2024-04-30"]:.2%}
    versus selling on 30 May {df["Cumulative Return"].loc["2024-05-30"]:.2%}
''')


    Selling on 30 Apr -8.13%
    versus selling on 30 May 3.18%



#### Part 2

What if you had bought at the start of May, and sold at the end of the month?

In [6]:
print(
    f'Buy on 1 May, Sell on 30 May {(1 + df.loc["2024-05"]["Daily Return"]).prod() - 1:.2%}'
)

Buy on 1 May, Sell on 30 May 12.31%


## Moving Averages

Moving averages are a different kind of indicator, one that smooths out small variations in trading data to give a better picture of the overall trend.

A Simple Moving Average (SMA) is one which averages out a price over a specific period. The average is "moving" because when a new day is considered in the period, the oldest date is discarded.

Moving averages can be *fast*, when they cover a short period, or *slow* when they consider a longer period. The longer the period, the more those small variations are smoothed out.

In [7]:
# Calculate a fast, 20-Day Moving Average
df['20-day MA'] = df['Adj Close'].rolling(window=20).mean()

# Calculate a slow, 200-Day Moving Average
df['200-day MA'] = df['Adj Close'].rolling(window=200).mean()

## Surges

Surges in price or trading volume can be helpful indicators for traders. We can define a surge as an increase on the day before by an amount higher than some set threshold. A common threshold is two standard deviations above the mean. Since price surges consider the change in price between two trading days, we use our already calculated daily returns, which provide that value for us!

In [8]:
# Find the mean return
mean_return = df["Daily Return"].mean()

# Define a threshold as two standard deviations above the mean
return_threshold = mean_return + (df["Daily Return"].std() * 2)

# Define a condition
condition = df["Daily Return"] > return_threshold

# Subset the dataframe where daily returns are higher than the threshold
df[condition]

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume,Daily Return,Cumulative Return,20-day MA,200-day MA
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2024-01-18,186.089996,189.139999,185.830002,187.119995,188.389618,78005800,0.032571,0.016107,,
2024-04-11,168.339996,175.460007,170.196207,175.039993,175.039993,91070300,0.043271,-0.055897,171.547499,
2024-05-02,172.509995,173.419998,170.889999,173.029999,176.340004,94214900,0.041583,-0.048885,169.98,
2024-05-03,186.649994,187.0,182.660004,183.380005,183.380005,163224100,0.039923,-0.010914,170.67,


### Exercise: In the news

Can you find any events associated with the days you identified as price surges?