# **Project: Predicting Stock Prices using Algorithms**

### **Disclaimer** 
**This project is created fully for informational and educational purposes. Predicting the stock market price is a highly complex task that involves<br> great financial risks, and past performance is not indicative of future results. Please consult with a qualified financial advisor before making any<br> form of investment decisions, and I strongly discourage you from using my machine learning model as a basis for your financial decisions. <br>
<br>
Furthermore, all data gathered for this project's use is obtained from open-source websites. <br>No information used in this project is obtained through unauthorized means.**

## **Phase Two: Testing Data Collection**
**Timescale: July, August 2023**

In [43]:
# importing libraries
import numpy as np
import pandas as pd

# yahoo finance
import yfinance as yf

import warnings
warnings.filterwarnings('ignore')

### **1. Stock Prices**
Source: Yahoo Finance, through Python library ***yfinance***<br><br>
**Data:**
- Open: price of MSFT when the stock market opens
- High: highest daily price of MSFT
- Low: lowest daily price of MSFT
- Close: price of MSFT when the stock market closes
- Adj Close: price of MSFT when the stock market closes, adjusted to account for corporate action.
    - **Indicates true value of MSFT over time**
- Volume: total number of orders made during the day

Since we are focusing on daily predictions, we only require Date, Adj. Close and Volume of MSFT.

In [46]:
df_prices = yf.download('MSFT', start='2023-06-30', end='2023-08-31')

# filtering out columns
df_prices = df_prices.reset_index()
df_prices = df_prices[['Date','Adj Close','Volume']].rename(columns={'Adj Close':'MSFT_Close','Volume':'MSFT_Volume'})
df_prices

[*********************100%%**********************]  1 of 1 completed


Unnamed: 0,Date,MSFT_Close,MSFT_Volume
0,2023-06-30,339.820526,26823800
1,2023-07-03,337.275909,12508700
2,2023-07-05,337.435577,18172400
3,2023-07-06,340.548981,28161200
4,2023-07-07,336.507538,21185300
5,2023-07-10,331.128906,32791400
6,2023-07-11,331.767578,26698200
7,2023-07-12,336.487579,29995300
8,2023-07-13,341.936035,20567200
9,2023-07-14,344.51059,28302200


In [47]:
# date imputing
all_days = pd.date_range('2023-06-30', '2023-08-31')
all_days = pd.DataFrame({"Date": all_days})

df_prices = df_prices.merge(all_days, left_on='Date', right_on='Date', how='right')
df_prices = df_prices.fillna(method='ffill')
df_prices = df_prices.drop(index=0)
df_prices

Unnamed: 0,Date,MSFT_Close,MSFT_Volume
1,2023-07-01,339.820526,26823800.0
2,2023-07-02,339.820526,26823800.0
3,2023-07-03,337.275909,12508700.0
4,2023-07-04,337.275909,12508700.0
5,2023-07-05,337.435577,18172400.0
...,...,...,...
58,2023-08-27,322.980011,21684100.0
59,2023-08-28,323.700012,14808500.0
60,2023-08-29,328.410004,19284600.0
61,2023-08-30,328.790009,15222100.0


### **2. Additional Financial Data**
1. **Economic Indicators**
    - GDP Growth Rate (Quarterly)
    - Inflation Rate (CPI, Monthly)
2. **Market Indices**: NASDAQ
    - MSFT is listed in NASDAQ.
3. **Earnings Report**
    - Total Revenue
    - Basic EPS = Net Income / Outstanding Shares
    - Outstanding Shares: shares owned by investors and company executives

#### **Economic Indicators**
Sources:
- GDP Growth Rate (quarterly): https://data.oecd.org/gdp/quarterly-gdp.htm
- Inflation rate (CPI, monthly): https://data.oecd.org/price/inflation-cpi.htm

In [48]:
# Inflation Rate
df_inf = pd.read_csv('data/CPI_USA_Monthly.csv')[['TIME','Value']]
df_inf = df_inf[df_inf['TIME']<='2023-07'].rename(columns={'Value':'Inflation_Rate'})

# GDP
df_gdp = pd.read_csv('data/GDP_USA_Quarterly.csv')[['TIME','Value']].rename(columns={'Value':'GDP_Growth_Rate'})
df_gdp['TIME'] = pd.to_datetime(df_gdp['TIME']).dt.strftime('%Y-%m')

# Combined
df_econs = df_inf.merge(df_gdp, left_on='TIME', right_on='TIME', how='left')
df_econs = df_econs.fillna(method='ffill')
df_econs['TIME'] = pd.to_datetime(df_econs['TIME'])

# Date Imputing
df_econs = all_days.merge(df_econs, left_on='Date', right_on='TIME', how='left')
df_econs = df_econs.drop(index=0,columns='TIME').fillna(method='ffill')
df_econs

Unnamed: 0,Date,Inflation_Rate,GDP_Growth_Rate
1,2023-07-01,3.17778,0.496812
2,2023-07-02,3.17778,0.496812
3,2023-07-03,3.17778,0.496812
4,2023-07-04,3.17778,0.496812
5,2023-07-05,3.17778,0.496812
...,...,...,...
58,2023-08-27,3.17778,0.496812
59,2023-08-28,3.17778,0.496812
60,2023-08-29,3.17778,0.496812
61,2023-08-30,3.17778,0.496812


#### **Market Indices**
- MSFT is listed under **NASDAQ**, hence we will be using the **NASDAQ Composite Index (.IXIC)**.
- All data in this section is obtained from Yahoo Finance, accessed through the Python library ***yfinance***.

In [49]:
# downloading data
nasdaq = yf.Ticker('^IXIC').history(period='max').reset_index()
nasdaq['Date'] = pd.to_datetime(nasdaq['Date'].dt.strftime('%Y-%m-%d'))
nasdaq = nasdaq[nasdaq['Date']>='2023-06-30']

# date imputing
nasdaq = all_days.merge(nasdaq, left_on='Date', right_on='Date', how='left')
nasdaq = nasdaq.fillna(method='ffill')
nasdaq = nasdaq.drop(index=0)
nasdaq = nasdaq[['Date','Close','Volume']].rename(columns={'Close':'NASDAQ_Close','Volume':'NASDAQ_Volume'})
nasdaq

Unnamed: 0,Date,NASDAQ_Close,NASDAQ_Volume
1,2023-07-01,13787.919922,4.661120e+09
2,2023-07-02,13787.919922,4.661120e+09
3,2023-07-03,13816.769531,2.902300e+09
4,2023-07-04,13816.769531,2.902300e+09
5,2023-07-05,13791.650391,5.339340e+09
...,...,...,...
58,2023-08-27,13590.650391,3.970060e+09
59,2023-08-28,13705.129883,3.666680e+09
60,2023-08-29,13943.759766,4.748180e+09
61,2023-08-30,14019.309570,4.364600e+09


#### **Earnings Report**
- Source: https://www.microsoft.com/en-us/investor/earnings/FY-2023-Q3/press-release-webcast

In [50]:
# Modifying
df_earnings = pd.read_excel('data/QuarterlyIncomeStatementsFY23_Modified.xlsx')
df_earnings = df_earnings.drop(columns=['Unnamed: 1','Unnamed: 2']).rename(columns={'Unnamed: 0':'Column'})
df_earnings = df_earnings[(df_earnings['Column']=='Quarter')|(df_earnings['Column']=='Total revenue')|(df_earnings['Column']=='Basic')]
df_earnings = pd.DataFrame({
    'Quarter':df_earnings.loc[0].values,
    'Revenue':df_earnings.loc[4].values,
    'EPS':df_earnings.loc[20].values
})
df_earnings = df_earnings.loc[21:31]

# Quarter to Date
quarter_to_month = {'Q1':'1','Q2':'4','Q3':'7','Q4':'10'}
df_earnings['Year'] = df_earnings['Quarter'].str[-2:]
df_earnings['Quarter'] = df_earnings['Quarter'].str[:2]
df_earnings['Month'] = df_earnings['Quarter'].map(quarter_to_month)
df_earnings['Quarter'] = pd.to_datetime(df_earnings['Year'] + df_earnings['Month'], format='%y%m')
df_earnings = df_earnings.drop(columns=['Year','Month']).rename(columns={'Quarter':'Date'})

# Revenue in millions
df_earnings['Revenue'] = df_earnings['Revenue'] * 1e6

# EPS
df_earnings['EPS'] = df_earnings['EPS'].str[1:].astype(float)

# date imputing
df_earnings = all_days.merge(df_earnings, left_on='Date', right_on='Date', how='left')
df_earnings = df_earnings.fillna(method='ffill')
df_earnings = df_earnings.drop(index=0)
df_earnings

Unnamed: 0,Date,Revenue,EPS
1,2023-07-01,5.285700e+10,2.46
2,2023-07-02,5.285700e+10,2.46
3,2023-07-03,5.285700e+10,2.46
4,2023-07-04,5.285700e+10,2.46
5,2023-07-05,5.285700e+10,2.46
...,...,...,...
58,2023-08-27,5.285700e+10,2.46
59,2023-08-28,5.285700e+10,2.46
60,2023-08-29,5.285700e+10,2.46
61,2023-08-30,5.285700e+10,2.46


In [51]:
# FCF in millions
df_cash = pd.read_excel('data/QuarterlyCashFlowStatementFY23_Modified.xlsx')
df_cash['FCF'] = df_cash['FCF'] * 1e6
df_cash = df_cash[df_cash['Date']=='2023-06-30']

# date imputing
df_cash = all_days.merge(df_cash, left_on='Date', right_on='Date', how='left')
df_cash = df_cash.fillna(method='ffill')
df_cash = df_cash.drop(index=0)

# merging with earnings
df_earnings = df_cash.merge(df_earnings, left_on='Date', right_on='Date', how='left')
df_earnings

Unnamed: 0,Date,FCF,Revenue,EPS
0,2023-07-01,5.947500e+10,5.285700e+10,2.46
1,2023-07-02,5.947500e+10,5.285700e+10,2.46
2,2023-07-03,5.947500e+10,5.285700e+10,2.46
3,2023-07-04,5.947500e+10,5.285700e+10,2.46
4,2023-07-05,5.947500e+10,5.285700e+10,2.46
...,...,...,...,...
57,2023-08-27,5.947500e+10,5.285700e+10,2.46
58,2023-08-28,5.947500e+10,5.285700e+10,2.46
59,2023-08-29,5.947500e+10,5.285700e+10,2.46
60,2023-08-30,5.947500e+10,5.285700e+10,2.46


### **3. Combining All Information**

In [52]:
df = pd.merge(df_econs, nasdaq, on='Date', how='outer')
df = pd.merge(df, df_earnings, on='Date', how='outer')
df = pd.merge(df, df_prices, on='Date', how='outer')
df = df[['Date','GDP_Growth_Rate','Inflation_Rate','NASDAQ_Close','NASDAQ_Volume','Revenue','EPS','FCF','MSFT_Volume','MSFT_Close']]
df

Unnamed: 0,Date,GDP_Growth_Rate,Inflation_Rate,NASDAQ_Close,NASDAQ_Volume,Revenue,EPS,FCF,MSFT_Volume,MSFT_Close
0,2023-07-01,0.496812,3.17778,13787.919922,4.661120e+09,5.285700e+10,2.46,5.947500e+10,26823800.0,339.820526
1,2023-07-02,0.496812,3.17778,13787.919922,4.661120e+09,5.285700e+10,2.46,5.947500e+10,26823800.0,339.820526
2,2023-07-03,0.496812,3.17778,13816.769531,2.902300e+09,5.285700e+10,2.46,5.947500e+10,12508700.0,337.275909
3,2023-07-04,0.496812,3.17778,13816.769531,2.902300e+09,5.285700e+10,2.46,5.947500e+10,12508700.0,337.275909
4,2023-07-05,0.496812,3.17778,13791.650391,5.339340e+09,5.285700e+10,2.46,5.947500e+10,18172400.0,337.435577
...,...,...,...,...,...,...,...,...,...,...
57,2023-08-27,0.496812,3.17778,13590.650391,3.970060e+09,5.285700e+10,2.46,5.947500e+10,21684100.0,322.980011
58,2023-08-28,0.496812,3.17778,13705.129883,3.666680e+09,5.285700e+10,2.46,5.947500e+10,14808500.0,323.700012
59,2023-08-29,0.496812,3.17778,13943.759766,4.748180e+09,5.285700e+10,2.46,5.947500e+10,19284600.0,328.410004
60,2023-08-30,0.496812,3.17778,14019.309570,4.364600e+09,5.285700e+10,2.46,5.947500e+10,15222100.0,328.790009


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   Date             62 non-null     datetime64[ns]
 1   GDP_Growth_Rate  62 non-null     float64       
 2   Inflation_Rate   62 non-null     float64       
 3   NASDAQ_Close     62 non-null     float64       
 4   NASDAQ_Volume    62 non-null     float64       
 5   Revenue          62 non-null     float64       
 6   EPS              62 non-null     float64       
 7   FCF              62 non-null     float64       
 8   MSFT_Volume      62 non-null     float64       
 9   MSFT_Close       62 non-null     float64       
dtypes: datetime64[ns](1), float64(9)
memory usage: 5.0 KB


We have completed the data collection phase of the project.<br>
We will be saving this dataframe into a .csv file.

In [54]:
df.to_csv('data/MSFT_data_test.csv', index=False)