# Stock Classification

## Summary of Notebook

**Project Background**: Global Industry Classification Standard (GICS) is a common global classification standard used by thousands of market participants across all major groups involved in the investment process. Each stock in the S&P 500 belongs to a GICS sector such as Information Technology, Financials, Industrials, Health Care, etc. Moreover, every GICS also has subsectors.

**Project Goal**: Given the multiple daily time series of a stock (Open, Close, High, Low, Adjusted Price, and Volume), predict the GICS sector of given stock.

**Methods**: Dataset contains multiple daily time series of $358$ stocks obtained from yahoo finance (after filtering from $500$ stocks) from the start of $2016$ till the end of $2018$, and the GICS sector obtained from wikipedia is chosen to be the label to be predicted. The features were mostly generated by tsfresh library and hurst exponent. Exploratory data analysis is carried out on useful features. Processed data is then used in an XGBoost model to predict GICS sector.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import csv
import sys
import datetime as dt

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
DIRECTORY_DATA = '../data/external/'

DIRECTORY_TICKERS = '../data/inter/tickers/'

In [4]:
START_YEAR = 2016
START_MONTH = 1
START_DAY = 1

END_YEAR = 2019
END_MONTH = 1
END_DAY = 1

START_DATE = dt.datetime(START_YEAR,START_MONTH,START_DAY)
END_DATE = dt.datetime(END_YEAR,END_MONTH,END_DAY)

SP500_WIKIPEDIA_LINK = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
TICKERS_UNFILTERED_FILENAME = 'tickers_unfiltered.pkl'

In [5]:
if '../src' not in sys.path:
    sys.path.append('../src')
    
import utils

Read and print the stock tickers of S&P500 from wikipedia

In [6]:
tickers = pd.read_html(SP500_WIKIPEDIA_LINK)[0]

In [7]:
tickers.head()

Unnamed: 0,Symbol,Security,SEC filings,GICS Sector,GICS Sub-Industry,Headquarters Location,Date first added,CIK,Founded
0,MMM,3M,reports,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1976-08-09,66740,1902
1,ABT,Abbott Laboratories,reports,Health Care,Health Care Equipment,"North Chicago, Illinois",1964-03-31,1800,1888
2,ABBV,AbbVie,reports,Health Care,Pharmaceuticals,"North Chicago, Illinois",2012-12-31,1551152,2013 (1888)
3,ABMD,Abiomed,reports,Health Care,Health Care Equipment,"Danvers, Massachusetts",2018-05-31,815094,1981
4,ACN,Accenture,reports,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989


In [8]:
tickers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 505 entries, 0 to 504
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Symbol                 505 non-null    object
 1   Security               505 non-null    object
 2   SEC filings            505 non-null    object
 3   GICS Sector            505 non-null    object
 4   GICS Sub-Industry      505 non-null    object
 5   Headquarters Location  505 non-null    object
 6   Date first added       457 non-null    object
 7   CIK                    505 non-null    int64 
 8   Founded                505 non-null    object
dtypes: int64(1), object(8)
memory usage: 35.6+ KB


Downloading Stocks

In [9]:
stocks = tickers['Symbol'].to_list()
missing_stocks = utils.download_stocks(DIRECTORY_DATA,stocks,START_YEAR,END_YEAR)
missing_stocks_df = pd.DataFrame(data={'Symbol':missing_stocks})

In [10]:
missing_stocks_df

Unnamed: 0,Symbol
0,BRK.B
1,BF.B
2,CARR
3,CTVA
4,DOW
5,FOXA
6,FOX
7,OGN
8,OTIS


Remove missing tickers

In [11]:
current_tickers = pd.merge(tickers,missing_stocks_df, on=['Symbol','Symbol'], how='outer',indicator=True)
current_tickers = current_tickers[current_tickers['_merge'] == 'left_only']
current_tickers = current_tickers.loc[:, ~current_tickers.columns.isin(['_merge'])].reset_index(drop=True)

In [12]:
current_tickers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 496 entries, 0 to 495
Data columns (total 9 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Symbol                 496 non-null    object
 1   Security               496 non-null    object
 2   SEC filings            496 non-null    object
 3   GICS Sector            496 non-null    object
 4   GICS Sub-Industry      496 non-null    object
 5   Headquarters Location  496 non-null    object
 6   Date first added       448 non-null    object
 7   CIK                    496 non-null    int64 
 8   Founded                496 non-null    object
dtypes: int64(1), object(8)
memory usage: 35.0+ KB


In [13]:
current_tickers.to_pickle(os.path.join(DIRECTORY_TICKERS,TICKERS_UNFILTERED_FILENAME))