<a href="https://colab.research.google.com/github/lnmurthy/FindYourStocks/blob/main/FinalProject.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stock Recommendation System Project
By Leisha Murthy and Rohan Bhalla





```
# This is formatted as code
```

In this project we make use of a kaggle dataset for which the json file has to be uploaded to the workspace for the code to run. Our goal was to use this dataset which contains information about stock prices and volume traded across several decades to recommend stocks to the user based on stocks that they are already interested in.

#Problem to solve
We want to provide investing beginners with a convinient way of looking up companies(stocks) to invest in based on their interests and preferences. This will take into account industries and performance of the stocks.


In [None]:
import pandas as pd
import numpy as np
import os
import glob



# Setup the workspace
The relevant imports are being done and kaggle is being installed so that we can directly download the dataset and get it in our working directory.


In [None]:
! pip install kaggle


In [None]:
! mkdir ~/.kaggle


In [None]:
! cp kaggle.json ~/.kaggle/


In [None]:
#kaggle api connection code
! chmod 600 ~/.kaggle/kaggle.json


In [None]:
! kaggle datasets download paultimothymooney/stock-market-data

In [None]:
! unzip stock-market-data

# Data Processing and Cleaning
The code below uses the unzipped file and sets up the list of dataframes. The correct ticker names have been obtained using the names of the files being used.

The dataset being used, by itself doesn't have the most amount of value, hence the ticker name and industry of the stock field have been added by us using the technique of extracting file name and inserting it in the column. Then we use this column along with the Yahoo Finance API to get further relevant data and populate the industry field for all the stocks.


In [None]:
path = "/content/stock_market_data/sp500/csv/"
all_files = glob.glob(path + "/*.csv")
all_names = os.listdir(path)
all_names.sort()
# print(all_names)
filelist = []
#access all files
counter = 0
for filename in all_files:
  df = pd.read_csv(filename,index_col=None, header=0)
  df = df.drop(['Low','Open','High','Adjusted Close'],axis=1)
  #Remove .csv suffix
  halfName = filename.rstrip(".csv")
  testName = halfName.lstrip("/content/stock_market_data/sp500/csv")
  # all_names[counter] = all_names[counter].rstrip(".csv")
  df.insert(0,'Ticker',testName)


  counter+=1
  #filter dates from 2012-2022
  df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y')
  mask = (df['Date'].dt.year > 2011) & (df['Date'].dt.year <= 2022)
  filtered_df = df.loc[mask]
  filelist.append(filtered_df)
filelist[1].head()



Unnamed: 0,Ticker,Date,Volume,Close
5694,IEX,2012-01-03,334600,37.299999
5695,IEX,2012-01-04,218300,37.41
5696,IEX,2012-01-05,409700,38.09
5697,IEX,2012-01-06,378400,38.0
5698,IEX,2012-01-09,301100,38.09


#Data Set Description
The Dataset has been obtained from Kaggle and it contains information for all major stock indices.


We selectively focus on the S&P 500 which contains 409 csv files. First all the data from these files has been extracted and placed into a single dataset which contains: 7673 rows.   


From the initial data exploration we learned that a few pieces of important data are missing. The name of the stocks/tickers was not present as a column and there was no other information about the stock apart from their close prices, dates and volume traded. Having so many rows in the dataset itself didn't yield any useful information for us but rather a more concise and organized representation was needed and this led us to deciding on the features of the dataset.

In our intial planning we aimed to have a list of dataframes setup for every stock in the S&P 500. For every stock we would add the ticker name to its dataframe and also restrict the year we are looking at to a much smaller and workable amount.

The features of relevance to us our the closing prices for a set of dates for every stock, the stock name, the volume of stock traded on the given day. These fields help in calculating the trends of stocks (percent changes) and also potentially we could look at the volume to understand the popularity if additional classification information is needed.  


In [None]:
#Number of files to work with
print("Number of files: "+str(len(all_files)))
#Number of rows in each dataframe
print("Number of Rows in the dataframe: "+str(len(df.index)))


Number of files: 409
Number of Rows in a file: 7673


In [None]:
#Compute 1, 5, 10 year averages of volume and price

#take 2012-2022, take volume column, add all volume and divide by number of years (1 year - 2022) (2017-2022) (2012-2022)
latest_year = df['Date'].dt.year.max()
num_years = 0
earliest_year = latest_year - num_years
mask = (df['Date'].dt.year >= earliest_year) & (df['Date'].dt.year <= latest_year)
filtered_df = df.loc[mask]
# print(filtered_df)
avg_volume = filtered_df['Volume'].mean()
# print('the average volume is', avg_volume)

#Percentage change testing
filtered_df.iloc[-1]
filtered_df.iloc[0]


Ticker                   SBUX
Date      2022-01-03 00:00:00
Volume                5475700
Close                  116.68
Name: 7435, dtype: object

The code below has to be run once to get the industry information for all the tickers. This is done using Yahoo finance API. For the tickers on which the lookup doesn't work the column value is set to "NOT FOUND".

In [None]:
#Getting industry information for the tickers
import yfinance as yf
count = 0
#NOTE:       UNCOMMENT THE LOOP BELOW AND RUN ONCE TO GET INDUSTRY INFO
for filename in all_files:
  # print(all_names[count])
  ticker = filelist[count].iloc[0]["Ticker"]
  tickerdata = yf.Ticker(ticker)
  sectordata = "NOT FOUND"
  try:
    sectordata = tickerdata.info['sector']
    filelist[count].insert(1,'Industry',sectordata)
  except:
    sectordata = "NOT FOUND"
    filelist[count].insert(1,'Industry',sectordata)
  count += 1
len(filelist)


409

In [None]:
# Dropping code for removing industry column
filelist[0].head()

Unnamed: 0,Ticker,Industry,Date,Volume,Close
7885,OKE,Energy,2012-01-03,2066925,38.268253
7886,OKE,Energy,2012-01-04,1709873,38.237614
7887,OKE,Energy,2012-01-05,1553849,38.277008
7888,OKE,Energy,2012-01-06,1164816,38.176327
7889,OKE,Energy,2012-01-09,1414500,37.97934


We wanted to set up another field for all the stocks to show their "beta values" which would help us in classifying the stocks based on how volatile they are and better match it with the risk level that the user is willing to take. However, the API doesn't offer this information even though their site does have this field posted. Therefore, this value is not being used in this recommendation as apart from scraping their is no way of extracting it.

In [None]:
#FAILED ATTEMPT AT GETTING BETA values
import yfinance as yf

#Adding beta values from yfinance api --> not working yet
#might have to self calculate using standard formula and data across certain time period
count = 0
# yf.Ticker(all_names[count])
# tickerdata = yf.Ticker('goog')
# sectordata = tickerdata.info['sector']
# tickerdata.get_analysis()
tsla = yf.Ticker('TSLA')
analysis = tsla.get_rev_forecast
print(analysis)

print(all_names[2])
# for filename in all_files:
#   print(all_names[count])
#   tickerdata = yf.Ticker(all_names[count])
#   sectordata = "NOT FOUND"
#   try:
#     sectordata = tickerdata.info['sector']
#   except:
#     sectordata = "NOT FOUND"
#   df.insert(1,'Industry',sectordata)
#   filelist[count].insert(1,'Industry',sectordata)
#   count += 1

<bound method TickerBase.get_rev_forecast of yfinance.Ticker object <TSLA>>
AAP.csv


In [None]:
#Collect list of all industry types (unique ones only)
industries = []
index = 0

for filename in all_files:
  industries.append(filelist[index].iloc[0]["Industry"])
  # print(filelist[index].iloc[0]["Industry"])
  index +=1
len(industries)
#Only keep unique industries out of 409 values
industrySet = set(industries)

#This is the variable we will use for list of industries (unique values only)
industryList = list(industrySet)
for item in industryList:
  print(item)
# len(industryList) # 12 options

Financial Services
Technology
Consumer Cyclical
Healthcare
Industrials
Energy
Utilities
Basic Materials
Communication Services
NOT FOUND
Real Estate
Consumer Defensive


# User Input and Profile Generation
Asking the user to enter three stock tickers. This is stored in a list and using this the stock information rows are looked up in the central dataframe (onYearDF). This is used in the two similarity finding recommendation techniques being applied in this project.

In [None]:
#Get user to input 3 tickers
userTickers = []
for i in range (3):
  inputTicker = str(input("Enter Ticker: "))
  userTickers.append(inputTicker)

Enter Ticker: A
Enter Ticker: AAPL
Enter Ticker: GOOG


When exploring options for our content based recommendation system, we wanted to utilize TF-IDF as a first step. We decided against this method as every value in our dataset was unique and did not have term frequency.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
#TFIDF IS NOT NEEDED FOR OUR CASE


tfidf = TfidfVectorizer(stop_words="english")
#filtered_df["Industry"] = filtered_df["Industry"].fill("na")
# tfid_matrix = tfidf.fit_transform(filtered_df["Industry"])
# cosine_sim = linear_kernel(tfidf_matrix,tfidf_matrix)
# indices = pd.Series(filtered_df.index,index=df_movies['Industry']).drop_duplicates()

Here we create a combined dataframe of all the tickers with their industry, one-year average price, and percentage change values. The code begins by initializing an empty dataframe, oneYearDF, with four columns - 'TickerName', 'Industry', 'OneYearAvg', and '%Change'.

Then, for each file in the filelist, the code extracts the ticker name, industry, and the filtered dataframe for the latest year using the Pandas library. The filtered dataframe is created by setting a mask that extracts data from the earliest year (which is set to the latest year minus 0) to the latest year.

The code then calculates the end price and start price of the filtered dataframe and computes the percentage change between them. Finally, the one-year average price is calculated and appended to the oneYearDF dataframe with the corresponding ticker name, industry, one-year average price, and percentage change values using the Pandas DataFrame's append method.

The index variable is used to keep track of the current iteration in the for-loop, and it is incremented by 1 after each iteration. The code outputs the length of the oneYearDF dataframe after all iterations are complete.

In [None]:
#Create combined dataframe of all tickers with their:
# industry,

index = 0
oneYearDF = pd.DataFrame(columns = ['TickerName', 'Industry', 'OneYearAvg', '%Change'])

for file in filelist:
  # industries.append(filelist[index].iloc[0]["Industry"])
  tickerName = filelist[index].iloc[0]["Ticker"]
  industry = filelist[index].iloc[0]["Industry"]
  # print(tickerName)
  #Avg calculation
  latest_year = filelist[index]['Date'].dt.year.max()
  earliest_year = latest_year - 0

  mask = (filelist[index]['Date'].dt.year >= earliest_year) & (filelist[index]['Date'].dt.year <= latest_year)
  filtered_df = filelist[index].loc[mask]
  # print(filtered_df.iloc[-1])

  #Get first and last row in the corresponding years
  endPrice = filtered_df.iloc[-1]['Close']
  startPrice = filtered_df.iloc[0]['Close']

  #Compute % change
  percentageChange = ((endPrice - startPrice)/startPrice) * 100
  oneYearAvg = filtered_df['Close'].mean()
  # print(oneYearAvg)
  oneYearDF = oneYearDF.append({'TickerName': tickerName, 'Industry':industry, 'OneYearAvg':oneYearAvg, '%Change':percentageChange}, ignore_index=True)
  index += 1
# len(oneYearDF)


In [None]:
oneYearDF.tail()

Unnamed: 0,TickerName,Industry,OneYearAvg,%Change
404,LH,Healthcare,247.630441,-23.433832
405,ZBH,Healthcare,115.603039,-0.142725
406,EMR,Industrials,88.512059,3.476455
407,CAH,Healthcare,61.597521,50.74025
408,SBUX,Consumer Cyclical,86.989605,-12.610648


In [None]:
oneYearArray = oneYearDF.to_numpy()
type(oneYearArray)
print(oneYearArray)


[['OKE' 'Energy' 62.25890747839663 7.984659054526134]
 ['IEX' 'Industrials' 202.8287392624286 2.2931177034758616]
 ['ROK' 'Industrials' 248.688109261649 -23.688391505500594]
 ...
 ['EMR' 'Industrials' 88.512058642732 3.4764545964980775]
 ['CAH' 'Healthcare' 61.597521196894284 50.740250059611526]
 ['SBUX' 'Consumer Cyclical' 86.98960466144466 -12.610647634766723]]


We imported the CountVectorizer module from the scikit-learn library and defines three empty lists for storing stock prices, industries, and percentage changes. The code then loops through a list of stock tickers called userTickers. For each ticker, it selects the row in a pandas dataframe called oneYearDF that matches the ticker and converts that row to a string. It then extracts the one-year average stock price, industry, and percentage change from the row and appends them to the respective lists. We then extracted the industry using the iloc method to select the last row of selectRow and indexing it with the 'Industry' column name.

In [None]:
#Lookup and fill vector with user selected stocks info
from sklearn.feature_extraction.text import CountVectorizer

#Names of tickers stores in userTickers list
lookUpPrices = []
lookUpIndustries = []
lookUpChanges = []
# userProfileDF = pd.DataFrame(columns = ['TickerName', 'Industry', 'OneYearAvg', '%Change'])
for tkr in userTickers:
  selectRow = oneYearDF.loc[oneYearDF['TickerName'] == tkr]
  #convert dataframe to string
  selectRow = selectRow.astype(str)
  lookUpPrices.append(float(selectRow['OneYearAvg']))
  # lookUpIndustries.append(str(selectRow['Industry']))
  lookUpIndustries.append(selectRow.iloc[-1]['Industry'])
  lookUpChanges.append(float(selectRow['%Change']))

  # print(selectRow['TickerName'])

We created a dataframe called recDF that filters the data in the oneYearDF dataframe based on user-defined criteria, including industry and percentage change in stock price. The code begins by creating parallel arrays of the industries, percentage changes, and average prices for the user-selected stock tickers.

The code then loops through each ticker in the userTickers list and sets upper and lower ranges for percentage change based on the percentage change value for that ticker. It filters the oneYearDF dataframe by the industry associated with the ticker and then further filters the resulting subset by the percentage change range. The resulting subset is then concatenated to the recDF dataframe.

Finally, the code outputs the length and the first 21 rows of the resulting dataframe. The commented out code lines, subSetDF.head() and print(userFilter), can be used to check intermediate results during the filtering process.

In [None]:
#Now we have parallel arrays of the three industries, %changes and average prices

# Working with userTickers, lookUpPrices, lookUpIndustries, lookUpChanges
recDF = pd.DataFrame()
#For one row of the dataframe--> get industry and set %change ranges then filter
index = 0
for tkr in userTickers:
  upperRange = lookUpChanges[index] + 5
  lowerRange = lookUpChanges[index] - 5
  #Filter by industry first
  subSetDF = oneYearDF.loc[oneYearDF['Industry'] == lookUpIndustries[index]]
  # subSetDF.head()

  #Filtering by percent change
  userFilter = subSetDF[(subSetDF['%Change'] >= lowerRange) & (subSetDF['%Change'] <= upperRange)]
  # print(userFilter)
  #Filter
  recDF = pd.concat([recDF, userFilter])
  index +=1
# subSetDF.head()
len(recDF)
recDF.head(21)



Unnamed: 0,TickerName,Industry,OneYearAvg,%Change
74,HOLX,Healthcare,71.447773,2.079349
172,UHS,Healthcare,120.174811,-3.457811
241,JNJ,Healthcare,172.264034,2.769034
321,A,Healthcare,131.997311,-0.690185
355,HCA,Healthcare,221.096177,-5.466782
386,BDX,Healthcare,248.799601,1.273445
405,ZBH,Healthcare,115.603039,-0.142725
27,NOK,Technology,5.015231,-22.496025
53,FLT,Technology,219.898824,-18.95634
114,TEL,Technology,128.059386,-24.177641


This link towards the end does what we are trying to accomplish: https://pub.towardsai.net/how-to-build-a-content-based-recommendation-system-f7d881a53e9a

https://www.youtube.com/watch?v=YMZmLx-AUvY&ab_channel=MyCourse


By determining the market trend stock price averages for entire year we noticed something strange about our data. The average stock price across all of the tickers present in our dataset was coming out to be surpising high (at a value of $1280.4312303275722). This was caused by an outlier in the dataset which is the BerkShire Hathway stock. We decided to drop this stock and create a copy of the dataframe.

In [None]:
# print(priceAvg)
copyOneYearDF = oneYearDF
priceAvg = copyOneYearDF["OneYearAvg"].mean()
print(priceAvg)



copyOneYearDF = copyOneYearDF.drop(copyOneYearDF[copyOneYearDF['TickerName']=='BRK-A'].index.values)
# copyOneYearDF = copyOneYearDF.drop(index=21)

priceAvg = copyOneYearDF["OneYearAvg"].mean()
print(priceAvg)
copyOneYearDF.head(30)

1280.4312303275722
150.34677233728536


Unnamed: 0,TickerName,Industry,OneYearAvg,%Change
0,OKE,Energy,62.258907,7.984659
1,IEX,Industrials,202.828739,2.293118
2,ROK,Industrials,248.688109,-23.688392
3,AAP,Consumer Cyclical,193.096512,-38.985555
4,V,Financial Services,206.949033,-3.639976
5,FIS,Technology,93.228971,-37.851244
6,ROL,Consumer Cyclical,35.354034,14.434531
7,AME,Industrials,126.359328,-2.248205
8,MNST,Consumer Defensive,89.401933,4.081212
9,WU,Financial Services,16.529265,-24.47383


Here we take a different approach to combining fields and applying different similarity techniques. The code begins by calculating the average percentage change and average stock price from the copyOneYearDF dataframe and printing the average stock price.

The code then adds two new columns to the dataframe, 'changeTag' and 'priceTag', which are classified strings based on the numerical values of '%Change' and 'OneYearAvg' columns. If the percentage change value is greater than the calculated average, the value in the 'changeTag' column is set to 'AboveAvg'; otherwise, it is set to 'BelowAvg'. Similarly, if the stock price is greater than the average, the value in the 'priceTag' column is set to 'HighPrice'; otherwise, it is set to 'LowPrice'.

These classified strings can then be used to apply different similarity techniques in the content-based recommendation system. The code outputs the first few rows of the modified dataframe using the head() method.

In [None]:
# Trying a different approach to combine fields and apply different similarity techniques
#Average values used for classification
changeAvg = copyOneYearDF["%Change"].mean()
priceAvg = copyOneYearDF["OneYearAvg"].mean()
print(priceAvg)
#Turning the number values into hard classified strings for applying similarity
copyOneYearDF['changeTag'] = np.where(copyOneYearDF['%Change'] > changeAvg, 'AboveAvg', 'BelowAvg')
copyOneYearDF['priceTag'] = np.where(copyOneYearDF['OneYearAvg'] > priceAvg, 'HighPrice', 'LowPrice')

copyOneYearDF.head()

150.34677233728536


Unnamed: 0,TickerName,Industry,OneYearAvg,%Change,changeTag,priceTag
0,OKE,Energy,62.258907,7.984659,AboveAvg,LowPrice
1,IEX,Industrials,202.828739,2.293118,AboveAvg,HighPrice
2,ROK,Industrials,248.688109,-23.688392,BelowAvg,HighPrice
3,AAP,Consumer Cyclical,193.096512,-38.985555,BelowAvg,HighPrice
4,V,Financial Services,206.949033,-3.639976,AboveAvg,HighPrice


Here we define a function create_combo(x) that takes in a row copyOneYearDF and combines the values in the "Industry", "changeTag", and "priceTag" columns into a single string, with each value converted to lowercase and separated by spaces.

Then, the function is applied to each row in the DataFrame copyOneYearDF using the apply() method with axis=1 to apply it row-wise. The resulting output is a new column called "combos" that contains the combined strings for each row in the DataFrame.

The purpose of this code is to create a new feature that combines multiple attributes of each stock into a single string for similarity calculation.

In [None]:
#Combo function that takes in a
def create_combo(x):
  combos = x['Industry'].lower()
  combos = combos + " " + (x['changeTag'].lower())
  combos = combos + " " + (x['priceTag'].lower())
  return combos

#Combining the tags into one field for similarity calculation
copyOneYearDF['combos'] = copyOneYearDF.apply(create_combo, axis=1)
copyOneYearDF.head()


Unnamed: 0,TickerName,Industry,OneYearAvg,%Change,changeTag,priceTag,combos
0,OKE,Energy,62.258907,7.984659,AboveAvg,LowPrice,energy aboveavg lowprice
1,IEX,Industrials,202.828739,2.293118,AboveAvg,HighPrice,industrials aboveavg highprice
2,ROK,Industrials,248.688109,-23.688392,BelowAvg,HighPrice,industrials belowavg highprice
3,AAP,Consumer Cyclical,193.096512,-38.985555,BelowAvg,HighPrice,consumer cyclical belowavg highprice
4,V,Financial Services,206.949033,-3.639976,AboveAvg,HighPrice,financial services aboveavg highprice


The copyOneYearDF['combos'] dataframe column is then transformed into a matrix of token counts using the count.fit_transform() method and stored in count_matrix. This matrix is then used to compute the cosine similarity between pairs of rows in the matrix using the cosine_similarity() method from sklearn.metrics.pairwise. The resulting matrix of cosine similarities is stored in the cosine_sim variable.

In [None]:
#Use the combo and convert to vector
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english')

# df1['soup']
count_matrix = count.fit_transform(copyOneYearDF['combos'])

#Cosine similarity computation
from sklearn.metrics.pairwise import cosine_similarity
cosine_sim = cosine_similarity(count_matrix)

This code defines a function get_index_from that takes in a ticker and returns the index value of the corresponding row in the copyOneYearDF dataframe. It then applies the function to the first ticker in userTickers and stores the index value in ticker_index. Finally, it prints the value of ticker_index.

In [None]:
#Look up indices of user Input
def get_index_from(ticker):
    return copyOneYearDF[copyOneYearDF['TickerName']== ticker].index.values[0]

ticker_index = get_index_from(userTickers[0])
print(ticker_index)


321


This code generates a list of similar stocks to the user input based on cosine similarity calculation. It first defines a list of similar stocks using the enumerate function on the cosine similarity values of the user input ticker. The list is then sorted in descending order based on the similarity values using the sorted function with a lambda function as the key. Finally, the sorted list is printed to the console.






In [None]:
#Generate similar list
similar_stocks = list(enumerate(cosine_sim[ticker_index]))

#sort the results
sorted_similar_stocks = sorted(similar_stocks, key=lambda x:x[1], reverse=True)
print(sorted_similar_stocks)

[(36, 0.9999999999999998), (54, 0.9999999999999998), (58, 0.9999999999999998), (169, 0.9999999999999998), (215, 0.9999999999999998), (252, 0.9999999999999998), (276, 0.9999999999999998), (321, 0.9999999999999998), (389, 0.9999999999999998), (5, 0.816496580927726), (13, 0.816496580927726), (26, 0.816496580927726), (30, 0.816496580927726), (33, 0.816496580927726), (35, 0.816496580927726), (43, 0.816496580927726), (57, 0.816496580927726), (69, 0.816496580927726), (72, 0.816496580927726), (74, 0.816496580927726), (84, 0.816496580927726), (95, 0.816496580927726), (100, 0.816496580927726), (109, 0.816496580927726), (113, 0.816496580927726), (118, 0.816496580927726), (125, 0.816496580927726), (136, 0.816496580927726), (142, 0.816496580927726), (144, 0.816496580927726), (155, 0.816496580927726), (163, 0.816496580927726), (164, 0.816496580927726), (165, 0.816496580927726), (170, 0.816496580927726), (182, 0.816496580927726), (197, 0.816496580927726), (206, 0.816496580927726), (211, 0.81649658092

Here two functions are definted to help with generating a list of similar stocks based on the user's input. The first function stock_title(index) takes an index as input, looks up the stock name with that index in copyOneYearDF, and returns the name. The second function print_stocks(sorted_similar_stocks) takes the sorted list of similar stocks generated earlier as input and prints the top 5 most similar stocks, excluding the user's input stock. If the list has fewer than 5 similar stocks, it only prints the ones available.

In [None]:
#Turn indices to stock names
def stock_title(index):
  return copyOneYearDF[copyOneYearDF.index == index]['TickerName'].values[0]

def print_stocks(sorted_similar_stocks):
  i=0
  for tkr in sorted_similar_stocks:
    if(stock_title(tkr[0]) != userTickers[0]):
      print(stock_title(tkr[0]))
    i=i+1
    if i>5:
        break

Here we run the recommendation system for all user-inputed tickers. For each ticker, it retrieves the index of the ticker from the dataframe, computes the similarity scores between all tickers using cosine similarity and returns a list of tuples with the index of each ticker and its similarity score with the selected ticker. The list is sorted in descending order based on the similarity score. Then, the function print_stocks() is called to print out the top five recommended tickers (excluding the selected ticker) for the current ticker. The process is repeated for all user-selected tickers, and the results are separated by a line of underscores.






In [None]:
#Repeat process for all User Selected Tickers
for ticker in userTickers:
  ticker_index = get_index_from(ticker)
  similar_stocks = list(enumerate(cosine_sim[ticker_index]))
  #sort the results
  sorted_similar_stocks = sorted(similar_stocks, key=lambda x:x[1], reverse=True)
  #call the print function
  print_stocks(sorted_similar_stocks)
  print("_________________________")



NVRO
WYNN
NTRA
NOW
LVS
VMC
_________________________
FIS
JPM
JBHT
BXP
AMZN
SRG
_________________________
GOOG
PKI
HLT
SIVB
NVRO
WYNN
_________________________


# Statistical Methods and Machine Learning Algorithms
In our project, we used a content-based filtering approach to recommend stocks to users. To achieve this, we utilized statistical methods such as CountVectorizer and machine learning algorithms such as cosine similarity. CountVectorizer was used to convert the textual data of the stocks into a numerical representation that can be easily processed by the algorithm. Cosine similarity was then applied to measure the similarity between the stocks based on their content. By comparing the similarity scores, the algorithm could identify and recommend the most similar stocks to the user. The algorithm returned the angles of similarity based on the two vectors and we sorted them in descending order. These methods were effective in providing personalized stock recommendations that matched the users' preferences and investment objectives.


# Other Potential Models
During the development of our project, we considered various other models before finalizing our approach. One of the models we evaluated was TF-IDF, which is a popular statistical technique used in text processing. However, we did not proceed with TF-IDF because our dataset did not have the appropriate term frequency format required for this method to work effectively. We also evaluated a linear approach with direct comparison of filtering methods. However, this approach did not account for the nuances and complexities of the dataset, and we found that it did not provide accurate recommendations. After thorough experimentation and evaluation, we determined that our chosen content-based filtering approach was the most effective and accurate for our dataset and use case.


# Validation
Our approach to validating the models involved a combination of quantitative and qualitative methods. In addition to the standard model evaluation techniques such as cross-validation and performance metrics, we conducted a manual check of the predicted trends using industry data sites. This involved comparing the predicted trends with the actual trends observed in the market and analyzing the reasons for any discrepancies. This manual validation helped us to confirm the accuracy of our models and identify any areas where they may need improvement. By combining both quantitative and qualitative methods, we were able to ensure the reliability of our models and make informed decisions based on their predictions.

# Business Applications
The model we have developed has several potential business applications. Firstly, it could be used by personal investment firms and financial advisors to provide personalized stock recommendations to their clients based on their investment objectives and risk tolerance. This could lead to better investment decisions and improved returns for clients. Secondly, the model could be used by individual investors to help them identify and evaluate potential investments based on their preferences and interests. Finally, the model could also be used to gain insights into investor sentiment. Overall, the model has a wide range of potential business applications.


# Possible Extension of System
The work done in this project can be easily replicated to perform recommendations for data outside of the S&P 500. Furthermore, the project can be extended to build dataframes corresponding 3 year, 5 year and 10 year stock trend data with the averages and percent change information. This can help in recommending stocks to users based on their investment strategy (short term/long term).