## Sentiment Analysis for Stock Market Tweets

- Collect a dataset
- Train a model 
- Predict accuracy of unlabelled tweets

This problem is straight forward however ***Where do we get data?***.

<a href="https://www.kaggle.com/datasets?search=stock+market+tweet"><img src="kaggle-logo.png" width="150" style="float:left; margin:10px;"></a><a href="https://twitter.com/search?q=%24SQ&src=cashtag_click"><img src="twitter-logo.png" width="150" style="float:left; margin:10px;"></a><a href="https://stocktwits.com/symbol/AAPL"><img src="stocktwits-logo.png" width="250" style="float:left; margin:10px;"></a>

- ### The best choice seems to be ***Scraping STOCKTWITS!*** 
    - Ensure the data is labelled 
    - Approaches to scraping:
    - APIs, Selenium, JSOUP
    - GOAL: ***get 1,000,000 labeled tweets***  
    - <a href="https://api.stocktwits.com/developers/docs">Stocktwits API limits</a>
<br />
<br />
- ### ***HOW TO COMBAT LIMIT ISSUES?***
    - Proxy services such as <a href="https://zenscrape.com/#pricingSection">Proxy Services</a>
    - Private computer VPN - Private Internet Access


In [5]:
# Multithreaded StockTwitsScraper
# Team: Stock Market Sentiment Analysis
# Danish Siddiqui & Stepthen Speer

import requests
import json
from json import JSONDecodeError
import re
import time

In [6]:
# sample request with AAPL
r = requests.get('https://api.stocktwits.com/api/2/streams/symbol/'+"UPS"+".json");
jsonText = json.dumps(r.json())
                
dictionary = json.loads(jsonText)
maxi = dictionary['cursor']['max']          
length = len(dictionary['messages'])

print("Number of tweets on page: "+str(length))

for i in range(5):
    message = dictionary['messages'][i]
    x = message['body']
    print(i,x)


Number of tweets on page: 30
0 $UPS The Logistics of Disaster Response https://www.otcdynamics.com/ups-the-logistics-of-disaster-response
1 Boeing loses more 737 MAX orders, eyes jet&#39;s U.S. return but Europe tariffs loom  $BA $UPS $AAL 

https://newsfilter.io/a/bb9fcd4b5c2150a923e6d6c8b52bc35a
2 $UPS $FDX https://finance.yahoo.com/news/dhl-fedex-ups-ready-save-100806840.html

GOING TO BE THE BEST QUARTER PT$200
3 $UPS ups Going to be busiest ever this Christmas $200 🏄🏿‍♂️🏄🏿‍♂️🏄🏿‍♂️🏄🏿‍♂️🏄🏿‍♂️🏄🏿‍♂️🏄🏿‍♂
4 U.S. Justice Department clears Uber-Postmates deal

https://pageone.ng/2020/11/10/u-s-justice-department-clears-uber-postmates-deal/

$UBER $POSTMATES $LYFT $UPS $FDX


### Data is dirty so clean up the data: 
***No emoticons, No Links, No special characters etc.*** 

In [7]:
def clean_tweet(tweet):
    
    # function deletes special characters 
    # deletes emoji's
        
    tweet = tweet.replace('\n',' ').replace('\r',' ') 
    return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

- ### ***Which Stocks to Scrape?***
- Scrape the top 2000 USA market cap stocks
- Found on Finviz: <a href="https://finviz.com/screener.ashx?v=111&f=geo_usa&o=-marketcap&r=21">Screener</a>
- <a href="http://atlaiser.com/phpmyadmin/index.php">Most tweets are found in the top stocks</a>

- ### ***How many tweets to scrape?***
- Select 10% of the tweets from watchers

<b><p>SELECT SUM(watchers) from tickers</p></b>

- #### ***Goal of 1.55 Million Tweets was achieved***
- Interesting insight: only about **20%** of tweets are **bearish**

<b><p>SELECT COUNT(*) from tweets</p></b>

<h2>Optimizing the Scraper</h2>
<h2 style="margin:0px;">10,000/hr -> 20,000/hr -> 100,000/hr</h2>
<h6 style="margin:0px;">Sequential -> Multithreaded -> Multithreaded Database Insertion</h6>

Final Code run-through