# Market Learning
*Leveraging machine learning to help you learn the market and invest your money*

### Objective
The Market Learning project is intended to provide an adaptive, fact-based introduction of the stock market to a user with limited experience looking for help with personal investment decisions.

### Outline
- Data Collection
- Data Cleaning
- Visualization
- Prediction
- Conclusion

In [2]:
import requests
from bs4 import BeautifulSoup as bs
import numpy as np
import pandas as pd
import yfinance as yf

### Data Collection
Get a list of tickers for various securities.
Get financial data from yfinance for those tickers

In [18]:
# Initialize lists of stock/security info
acronyms = []
# security_type = []
# primary_exchange = []
# industry = []
# other_categories = []
# name = []

In [16]:
# Get a list of tickers for different types of securities: stocks, ETFs, mutual funds, etc.
# https://www.cboe.com/us/equities/market_statistics/listed_symbols/
page_stocks = requests.get('https://stockanalysis.com/stocks/')
page_stocks.status_code

200

In [50]:
# webscrape the stockanalysis page for the stock acronyms
soup = bs(page_stocks.content, 'html.parser')
results = soup.select('.svelte-1jtwn20 a')
for item in results:
    acronyms.append(item.get_text())
len(acronyms)# , acronyms

1000

In [49]:
ticker_objects = yf.Tickers(' '.join(acronyms))
# ticker_objects.tickers

In [53]:
df_aapl = ticker_objects.tickers['AAPL'].history(period='max', repair=False)
# df_aapl.info(), df_aapl.head()

In [71]:
ticker_objects.tickers['AAPL'].info

{'address1': 'One Apple Park Way',
 'city': 'Cupertino',
 'state': 'CA',
 'zip': '95014',
 'country': 'United States',
 'phone': '408 996 1010',
 'website': 'https://www.apple.com',
 'industry': 'Consumer Electronics',
 'sector': 'Technology',
 'longBusinessSummary': 'Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, and HomePod. It also provides AppleCare support and cloud services; and operates various platforms, including the App Store that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. In addition, the company offers various services, such as Apple Arcade, a game subscription service; Apple Fitness+, a personalized

In [54]:
# Mark each company's data with the security type, acronym/name, and other info
df_aapl['security_type'] = ticker_objects.tickers['AAPL'].info['quoteType']
df_aapl['acronym'] = ticker_objects.tickers['AAPL'].info['symbol']
# df_aapl['name'] = ticker_objects.tickers['AAPL'].info['shortName']
# df_aapl['primary_exchange'] = ticker_objects.tickers['AAPL'].info['exchange']
# df_aapl['industry'] = ticker_objects.tickers['AAPL'].info['industry']
# df_aapl['sector'] = ticker_objects.tickers['AAPL'].info['sector']

# df_aapl.info(), df_aapl.head()

In [48]:
# Adding all of the categories as strings takes up too much memory
# I may be better off directly turning the classifications into number representations
# However, some methods of doing that (like target encoding) require seeing the whole dataset first
from enum import Enum
class SecurityType(Enum):
    EQUITY = 1
    ETF = 2
    MUTUALFUND = 3
    INDEX = 4
    CURRENCY = 5
    FUTURE = 6
    CRYPTO = 7


1
<class 'dict'>
<class 'str'> <class 'yfinance.ticker.Ticker'>


In [75]:
# Repeat this process for all tickers
# I haven't run this yet but I think it should work
df_all = pd.DataFrame()
for ticker in ticker_objects.tickers:
    df = ticker_objects.tickers[ticker].history(period='max', repair=False)
    df['Security Type'] = SecurityType[ticker_objects.tickers[ticker].info['quoteType']].value
    df['Acronym'] = ticker # ticker_objects.tickers[ticker].info['symbol']
#     df['Name'] = ticker_objects.tickers[ticker].info['shortName']
#     df['Exchange'] = ticker_objects.tickers[ticker].info['exchange']
#     df['Industry'] = ticker_objects.tickers[ticker].info['industry']
#     df['Sector'] = ticker_objects.tickers[ticker].info['sector']
    df_all = pd.concat([df_all, df])

df_all.info(), df_all.head()

Got error from yahoo api for ticker AGM.A, Error: {'code': 'Not Found', 'description': 'No data found, symbol may be delisted'}
- AGM.A: No timezone found, symbol may be delisted
Got error from yahoo api for ticker AKO.A, Error: {'code': 'Not Found', 'description': 'No data found, symbol may be delisted'}
- AKO.A: No timezone found, symbol may be delisted
Got error from yahoo api for ticker AKO.B, Error: {'code': 'Not Found', 'description': 'No data found, symbol may be delisted'}
- AKO.B: No timezone found, symbol may be delisted
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1758903 entries, 1999-11-18 00:00:00-05:00 to 2023-04-17 00:00:00-04:00
Data columns (total 10 columns):
 #   Column         Dtype  
---  ------         -----  
 0   Open           float64
 1   High           float64
 2   Low            float64
 3   Close          float64
 4   Volume         float64
 5   Dividends      float64
 6   Stock Splits   float64
 7   Security Type  object 
 8   Acronym        objec

(None,
                                 Open       High        Low      Close  \
 Date                                                                    
 1999-11-18 00:00:00-05:00  27.761131  30.506740  24.405391  26.845930   
 1999-11-19 00:00:00-05:00  26.197658  26.235793  24.290988  24.634190   
 1999-11-22 00:00:00-05:00  25.206192  26.845930  24.443524  26.845930   
 1999-11-23 00:00:00-05:00  25.930721  26.617123  24.405384  24.405384   
 1999-11-24 00:00:00-05:00  24.481654  25.587524  24.405387  25.053656   
 
                                Volume  Dividends  Stock Splits Security Type  \
 Date                                                                           
 1999-11-18 00:00:00-05:00  62546380.0        0.0           0.0        EQUITY   
 1999-11-19 00:00:00-05:00  15234146.0        0.0           0.0        EQUITY   
 1999-11-22 00:00:00-05:00   6577870.0        0.0           0.0        EQUITY   
 1999-11-23 00:00:00-05:00   5975611.0        0.0           0.0     