MNB And RFC Model Import Bullish Test Rev 1.2

This will test to see if we can import a classification model and apply it to a different data set. The original code was from a PyCaret example.

NOTE: There are two aspect that need to be remembered. The first is the data manipulation to get it ready to get processed by the model and the import of the model itself.

Also there is the model itself and the vocabulary that accompanies the model. When the model is created it also creates a vocabulary from the training data. If the model and the vocabulary are the same name then the model and vocabulary or saved under the same name and one file. If they have different names then both files must be save and used together in the same was as when the model was created.

https://towardsdatascience.com/nlp-classification-in-python-pycaret-approach-vs-the-traditional-approach-602d38d29f06

https://github.com/prateek025/SMS_Spam_Ham/blob/master/SMS_Spam_Ham_Raw.csv

https://github.com/prateek025/SMS_Spam_Ham/blob/master/Spam-Ham.ipynb

https://pycaret.org/tune-model/

For first time uses you most likely will need to install several different libraries as well as update/upgrade some of the libraries you already have. 

With the installation of pycaret, if you upgraded numpy, it removes the latest version of numpy and references the previous version. To get around this upgrade numpy after the pycaret library install.

You might have to intall the pandas-profiling library. To do so run the following:
pip install pandas-profiling --user

To find the library version:
print(scipy.__version__)

To upgrade to the latest version:
pip install scipy --upgrade --user

To install pycaret:
pip install pycaret --user

To install spacy (for pycaret):
pip install --user https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

Correct the errors listed as a result of loading the pycaret library. Most of them will be fixed by either loading the latest version or by installing the missing library.


Revision History:

Model Import Test 0.0 - 1.1

Imports the NBC model and runs the model on a data set with having to create a new vocabulary list and model. 
0.1 - Throws the following error when a different data set is applied to a previously saved model:
   "ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 202 is different from 213)"
   - throws an error if the data set to be analyzed has more columns that the data set used to create the model. This "could" mean that the original model did not have as many words. Test this theory we will create the model with the larger vocbulary and use it on the data set with the smaller vocabulary.
   - Adds method to pick which model and which data set to be evaluated. (Done)
   - Adds Gradient model (Not done)
   - Adds accuracy predictions (Not Done)
0.2 - runs the model on a single record - didn't work
0.3 - adjusted the columns by adjusting the stopwords - worked
0.4 - tried inputing a string into the .predict function - didn't work
0.5 - stripped out all of the other models and left NB Mulitnomial classifier and used vector.vocabulary & changed vectorizer also vectorized only the body column instead of all of them.  - worked **** THIS STEP WAS KEY ****
0.6 - added changes in 0.5 to a full version with all models - worked
0.7 - will try to remove all columns except the body and sentiment (independent and dependent variables) essentially did this by vectorizing only the body column in the model creation and in the vectorizing of the new data set.
0.8 - outputs model to a .sav file.
0.9 - The original example only had one independent variable and one dependent variable. The data set that I was importing had multiple and was making all of them dependent variables. This was causing the dimensional problems. Need to export the model with only one independent variable and one dependent variable, read it in where the model is not generated and is only read in and then apply it to a new data set and see if it runs. - worked/successful
1.0 - preprocesses the new data in the same way as the data was that created the model; joins the predicted outcomes with the original file to perform accuracy checks (completed)
1.1 - Attemps to inport optimized MultinomialNB model; attempts to import optimized RandomForestClassifier model

MNB And RFC Model Import Bullish Test Rev 1.2
1.2 - add the RandomForestClassifier Model

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_profiling
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
import pickle
import os
import time
import re

In [3]:
# METHODS

def getData(name):
    df1 = pd.DataFrame() # defines df1 as a dataframe
    df1 = pd.read_csv(name, header = 0)
    return df1

# displays a list of file with on a defined suffix/extension       
def list_dir_files(relevant_path, exten):
    # https://clay-atlas.com/us/blog/2019/10/27/python-english-tutorial-solved-unicodeescape-error-escape-syntaxerror/?doing_wp_cron=1618286551.1528689861297607421875
    #need to change \ to /

    # uses os.listdir to display only .csv files
    import os
    
    included_extensions = [exten]
    file_names = [fn for fn in os.listdir(relevant_path)
              if any(fn.endswith(ext) for ext in included_extensions)]

    print('Path: ', relevant_path)

    for f in file_names:
        print(f)
        
# DATA PREPARATION 1 - cleans up the dataframe: drops duplicate records; removes duplicate headers; removes unnecessary columns; renames body column;
#separates the sentiment column into Bullish, None, Bearish; copies df1 to df2

def DataPrep(df1):
    
    df1 = df1.drop_duplicates() # removes duplicate records
    len(df1)

    column = 'symbol'
    df1.drop(df1[df1['symbol'] == column].index, inplace=True) #removes duplicate headers

    df1 = df1.reset_index(drop = True) # resets the index

    #Note: symbol and created_at will be needed for app
    df1.drop(['symbol', 'created_at', 'followers'], inplace = True, axis = 1) #deletes columns that are not needed for creating the model
    print(len(df1))

    df1.rename(columns = {'body' : 'body_original'}, inplace = True) #renmaes body as body_original
    display(df1.head())
    df1 = pd.get_dummies(df1, columns = ['sentiment'], drop_first = False) # drop_first is false to get all three possibilities (column sentiments)

    df2 = df1

    df2 = df2[['sentiment_Bullish', 'sentiment_None' , 'sentiment_Bearish', 'body_original']] #reorders the columns

    print(df2.columns)

    print('Data Prep 1 completed. (df clean up; drop duplicates records, headers, certain rows; reorders columns) \n')

    # DATA PREPARATION 2 - cleaning up the comments/tweets: Remove HTTP tags, Converts all to lower case, Removes punctuation;
    #Removes unicodes

    import re
    from bs4 import BeautifulSoup

    # Remove HTTP tags
    %time df2['body_Processed'] = df2['body_original'].map(lambda x : ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()))
    print('HTTP tags removed. \n')

    #Lower Case
    %time df2['body_Processed'] = df2['body_Processed'].map(lambda x: x.lower())
    print('Converted to lower case. \n')

    #Remove punctuations
    %time df2['body_Processed'] = df2['body_Processed'].map(lambda x: re.sub(r'[^\w\s]', '', x))
    print('Removed punctuatuations. \n')

    #Remove unicodes
    %time df2['body_Processed'] = df2['body_Processed'].map(lambda x : re.sub(r'[^\x00-\x7F]+',' ', x))

    df2.head()
    print('Removed unicodes. \n')

    print('Data Prep 2 completed. (Removal of HTTP tags, punctuation, unicodes; lower case conversion; \n')

    # DATA PREPARATION 3 - lemmatization/stopwords: Removes stopwords, Lemmatizes the words, Removes stopwords from lemmatizaation.

    from nltk.stem import WordNetLemmatizer
    from nltk.corpus import stopwords

    # Remove stopwords
    stop_words = stopwords.words('english')

    #adds new stopwords to list
    new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', 'unh', '39', ' 270',
                  '270000', '4033477', '244', '16', '399', '800', '270', '000', '60', '74',
                 '1600', '993', '392', '98', '00', '1601', 'amd', 'aapl', '03', '10', '100',
                  '15', '18', '19', '20', '2021', '57', '0', '5', '11', 'qcom', 'hon', 'ibm',
                 'intel', '05', '12', '13', '14', '17', '21', '22', '30', '50', 'intel']

    '''new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', '00 call', '00 call entry',]
    '''

    '''new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', 'unh', '39', ' 270',
                  '270000', '4033477', '244', '16', '399', '800', '270', '000', '60', '74',
                 '1600', '993', '392', '98', '00', 'amd', 'aapl']'''

    for w in new_stop_words:
        stop_words.append(w)

    #print(stop_words)

    #removes the stopwords from the column body_Processed
    %time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([w for w in x.split() if w not in stop_words]))
    df2.head()

    df = df2
    # Lemmatize the text
    lemmer = WordNetLemmatizer()

    import nltk #not in original code
    nltk.download('wordnet') #not in original code

    %time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([lemmer.lemmatize(w) for w in x.split() if w not in stop_words]))
    df2.head()

    #Removing Stop words again after Lemmatize
    %time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([w for w in x.split() if w not in stop_words]))
    display(df2.head())
    print('Data Prep #3 completed (removal of stopwords and lemmatization) ...\n')
    
    return df2

#function to prepare Confusion Matrix, RoC-AUC curve, and relvant statistics
def clf_report(Y_test, Y_pred, probs):
    print("\n", "Confusion Matrix")
    cm = confusion_matrix(Y_test, Y_pred)
    #print("\n", cm, "\n")
    sns.heatmap(cm, square=True, annot=True, cbar=False, fmt = 'g', cmap='RdBu',
                #xticklabels=['ham', 'spam'], yticklabels=['ham', 'spam'])
                #xticklabels=['Bullish', 'Non-bullish'], yticklabels=['Bullish', 'Non-bullish'])
                xticklabels=['None', 'Non-None'], yticklabels=['None', 'Non-None'])

    plt.xlabel('true label')
    plt.ylabel('predicted label')
    plt.show()
    print("\n", "Classification Report", "\n")
    print(classification_report(Y_test, Y_pred))
    print("Overall Accuracy : ", round(accuracy_score(Y_test, Y_pred) * 100, 2))
    print("Precision Score : ", round(precision_score(Y_test, Y_pred, average='binary') * 100, 2))
    print("Recall Score : ", round(recall_score(Y_test, Y_pred, average='binary') * 100, 2))
    preds = probs[:,1] # this is the probability for 1, column 0 has probability for 0. Prob(0) + Prob(1) = 1
    fpr, tpr, threshold = roc_curve(Y_test, preds)
    roc_auc = auc(fpr, tpr)
    print("AUC : ", round(roc_auc * 100, 2), "\n")
    #display(probs)
    #print("Cutoff Probability : ", preds)
    plt.figure()
    plt.plot(fpr, tpr, label='Best Model on Test Data (area = %0.2f)' % roc_auc)
    plt.plot([0.0, 1.0], [0, 1],'r--')
    plt.xlim([-0.1, 1.1])
    plt.ylim([-0.1, 1.1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('RoC-AUC on Test Data')
    plt.legend(loc="lower right")
    plt.savefig('Log_ROC')
    plt.show()
    print("--------------------------------------------------------------------------")

def accuracy(dfPred):
    i = 0
    right_score = 0
    wrong_score = 0
    while i < len(dfPred):
        if dfPred.iloc[i,5] == dfPred.iloc[i,0]:
            right_score += 1
        else:
            wrong_score += 1
        
        i += 1
    
    print('Total Correct: ', right_score, '; Percent Correct: ', int(right_score/len(dfPred) * 1000)/10, '%')
    print('Total Incorrect: ', wrong_score, '; Percent Incorrrect: ', int(wrong_score/len(dfPred) * 1000)/10, '%')
    

In [5]:
# MAIN

from sklearn.metrics import roc_curve, auc, classification_report, confusion_matrix, precision_score, recall_score,  accuracy_score, precision_recall_curve
from sklearn.feature_extraction.text import CountVectorizer

###################
#HERE IS WHERE THE SAVED MODEL AND VOCAB FILES ARE LOADED WITH PICKLE
###################

print("\n", 'Naive Bayes Classifier')

#relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests'
relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/Sentiment/rfcFinviz'
exten = 'sav'
list_dir_files(relevant_path, exten)

MNB_ModelFileToLoad = input('\nWhat MNB model file do you want to load? \n')
MNB_VocabFileToLoad = input('\nWhat MNB vocabulary file do you want to load? \n')

RFC_ModelFileToLoad = input('\nWhat RFC model file do you want to load? \n')
RFC_VocabFileToLoad = input('\nWhat RFC vocabulary file do you want to load? \n')

'''
#more recent but smaller data set
ModelFileToLoad = '2021-05-22 tech search stocktwitsModel.sav'
VocabFileToLoad = '2021-05-22 tech search stocktwitsVocab.sav'

ModelFileToLoad = 'tech stockTwit 03112021Model.sav'
VocabFileToLoad = 'tech stockTwit 03112021Vocab.sav'

ModelFileToLoad = 'mnb_optimized_5-22 and 6-01 tech search stocktwitsModel.sav'
VocabFileToLoad = 'mnb_optimized_5-22 and 6-01 tech search stocktwitsVocab.sav'
'''##

MNB_loaded_model = pickle.load(open(MNB_ModelFileToLoad, 'rb')) #The saved MNB model is loaded
MNB_loaded_vocab = pickle.load(open(MNB_VocabFileToLoad, 'rb')) #The saved MNB vocab is loaded

RFC_loaded_model = pickle.load(open(RFC_ModelFileToLoad, 'rb')) #The saved RFC model is loaded
RFC_loaded_vocab = pickle.load(open(RFC_VocabFileToLoad, 'rb')) #The saved RFC vocab is loaded

########
# LOADS NEW DATA SET; PROCESSES THE NEW DATA; RUNS THE MODEL ON THE NEW DATA.
########

relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests'
exten = 'csv'
list_dir_files(relevant_path, exten)

TestFileToLoad = input('What file do you want to perform sentiment analysis on? \n')

#TestFileToLoad = '2021-05-22 tech search stocktwits.csv'
df1 = getData(relevant_path + '/' + TestFileToLoad) #returns df; reads csv file into df

#tf_vectorizer = CountVectorizer(vocabulary=tf_vectorizer.vocabulary_)
#tf_vectorizer = MNB_loaded_vocab
tf_vectorizer = RFC_loaded_vocab



##############################
#Data Preparation
##############################
df1 = df1.drop_duplicates() # removes duplicate records
len(df1)

column = 'symbol'
df1.drop(df1[df1['symbol'] == column].index, inplace=True) #removes duplicate headers

df1 = df1.reset_index(drop = True) # resets the index

#Note: symbol and created_at will be needed for app
df1.drop(['symbol', 'created_at', 'followers'], inplace = True, axis = 1) #deletes columns that are not needed for creating the model
print(len(df1))

df1.rename(columns = {'body' : 'body_original'}, inplace = True) #renmaes body as body_original
display(df1.head())
df1 = pd.get_dummies(df1, columns = ['sentiment'], drop_first = False) # drop_first is false to get all three possibilities (column sentiments)

df2 = df1

df2 = df2[['sentiment_Bullish', 'sentiment_None' , 'sentiment_Bearish', 'body_original']] #reorders the columns

print(df2.columns)

print('Data Prep 1 completed. (df clean up; drop duplicates records, headers, certain rows; reorders columns) \n')

# DATA PREPARATION 2 - cleaning up the comments/tweets: Remove HTTP tags, Converts all to lower case, Removes punctuation;
#Removes unicodes

import re
from bs4 import BeautifulSoup

# Remove HTTP tags
%time df2['body_Processed'] = df2['body_original'].map(lambda x : ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()))
print('HTTP tags removed. \n')

#Lower Case
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x: x.lower())
print('Converted to lower case. \n')

#Remove punctuations
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x: re.sub(r'[^\w\s]', '', x))
print('Removed punctuatuations. \n')

#Remove unicodes
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : re.sub(r'[^\x00-\x7F]+',' ', x))

df2.head()
print('Removed unicodes. \n')

print('Data Prep 2 completed. (Removal of HTTP tags, punctuation, unicodes; lower case conversion; \n')

# DATA PREPARATION 3 - lemmatization/stopwords: Removes stopwords, Lemmatizes the words, Removes stopwords from lemmatizaation.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Remove stopwords
stop_words = stopwords.words('english')

#adds new stopwords to list
new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', 'unh', '39', ' 270',
     '270000', '4033477', '244', '16', '399', '800', '270', '000', '60', '74',
    '1600', '993', '392', '98', '00', '1601', 'amd', 'aapl', '03', '10', '100',
     '15', '18', '19', '20', '2021', '57', '0', '5', '11', 'qcom', 'hon', 'ibm',
    'intel', '05', '12', '13', '14', '17', '21', '22', '30', '50', 'intel']

'''new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', '00 call', '00 call entry',]
'''

'''new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', 'unh', '39', ' 270',
      '270000', '4033477', '244', '16', '399', '800', '270', '000', '60', '74',
     '1600', '993', '392', '98', '00', 'amd', 'aapl']'''

for w in new_stop_words:
    stop_words.append(w)

#print(stop_words)

#removes the stopwords from the column body_Processed
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([w for w in x.split() if w not in stop_words]))
df2.head()

# df = df2
# Lemmatize the text
lemmer = WordNetLemmatizer()

import nltk #not in original code
nltk.download('wordnet') #not in original code

%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([lemmer.lemmatize(w) for w in x.split() if w not in stop_words]))
df2.head()

#Removing Stop words again after Lemmatize
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([w for w in x.split() if w not in stop_words]))
display(df2.head())
print('Data Prep #3 completed (removal of stopwords and lemmatization) ...\n')
 
###############################
# APPLIES VOCABULARY AND TOKENIZES THE PREPROCESSED COMMENTS; tf_vectorizer carries the vocabulary and the function??
###############################

X_new = tf_vectorizer.transform(df2['body_Processed']) #converts the strings into vectors for sentiment analysis

###############################
# MNB MODEL
###############################
MNB_Y_pred = MNB_loaded_model.predict(X_new) #Runs the MNB model on the new data set

display('New Y_pred: ', MNB_Y_pred)

df3 = pd.DataFrame(MNB_Y_pred, columns = ['Predicted Bullish Sentiment']) #creates a new df3 from the Y_pred array
MNB_dfPred = df2.join(df3) #joins the df2 and df3 dataframes together

print('The accuracy for the MNB model is: ')
accuracy(MNB_dfPred)

###############################
# RFC MODEL
###############################
RFC_Y_pred = RFC_loaded_model.predict(X_new) #Runs RFC model on the new data set
df4 = pd.DataFrame(RFC_Y_pred, columns = ['Predicted Bullish Sentiment']) #creates a new df4 from the Y-Pred array
RFC_dfPred = df2.join(df4) #joins the df2 and df4 dataframes togther

print('The accuracy for the RFC model is: ')
accuracy(RFC_dfPred)

#writes the RFC_dfPred to a csv file
RFC_dfPred.to_csv('RFC_bullish_dfPred.csv', index = False, encoding = 'utf-8')
print('The csv file was written. File name: ', 'RFC_bullish_dfPred.csv')


 Naive Bayes Classifier
Path:  C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/Sentiment/rfcFinviz
5-22 and 6-01 tech search stocktwits-Copy1 MNB Model-Copy1.sav
5-22 and 6-01 tech search stocktwits-Copy1 MNB Vocab-Copy1.sav
rfc_optimized_5-22 and 6-01 tech search stocktwits-Copy1 Model-Copy1.sav
rfc_optimized_5-22 and 6-01 tech search stocktwits-Copy1 Vocab-Copy1.sav

What MNB model file do you want to load? 
5-22 and 6-01 tech search stocktwits-Copy1 MNB Model-Copy1.sav

What MNB vocabulary file do you want to load? 
5-22 and 6-01 tech search stocktwits-Copy1 MNB Vocab-Copy1.sav

What RFC model file do you want to load? 
rfc_optimized_5-22 and 6-01 tech search stocktwits-Copy1 Model-Copy1.sav

What RFC vocabulary file do you want to load? 
rfc_optimized_5-22 and 6-01 tech search stocktwits-Copy1 Vocab-Copy1.sav
Path:  C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests
2021-05-22 tech search stocktwits-Copy1.csv
2021-

Unnamed: 0,messageID,body_original,sentiment,date,time
0,337302427,$AMD $ATNF $HITID $INTC $AMC ..🚀🚀Let&#39;s go ...,,2021-06-01,05:53:17
1,337301864,$AMD $ATNF Our team is up 1253% yesterday so f...,Bullish,2021-06-01,05:48:16
2,337301818,$amd $atnf $intc ..Started trading 5 months ag...,Bullish,2021-06-01,05:47:49
3,337300512,$amd $atnf $hitid $intc $amc ..Started tradin...,,2021-06-01,05:35:24
4,337300175,$SPY $ATNF $AMC $INTC..Mark this post... I’m g...,Bullish,2021-06-01,05:31:49


Index(['sentiment_Bullish', 'sentiment_None', 'sentiment_Bearish',
       'body_original'],
      dtype='object')
Data Prep 1 completed. (df clean up; drop duplicates records, headers, certain rows; reorders columns) 

Wall time: 69.8 ms
HTTP tags removed. 

Wall time: 2 ms
Converted to lower case. 

Wall time: 11 ms
Removed punctuatuations. 

Wall time: 12 ms
Removed unicodes. 

Data Prep 2 completed. (Removal of HTTP tags, punctuation, unicodes; lower case conversion; 

Wall time: 258 ms


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pstri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Wall time: 428 ms
Wall time: 197 ms


Unnamed: 0,sentiment_Bullish,sentiment_None,sentiment_Bearish,body_original,body_Processed
0,0,1,0,$AMD $ATNF $HITID $INTC $AMC ..🚀🚀Let&#39;s go ...,atnf hitid amc let go big chat tradethemomentu...
1,1,0,0,$AMD $ATNF Our team is up 1253% yesterday so f...,atnf team 1253 yesterday far small cap play gr...
2,1,0,0,$amd $atnf $intc ..Started trading 5 months ag...,atnf started trading month ago 4k made profit ...
3,0,1,0,$amd $atnf $hitid $intc $amc ..Started tradin...,atnf hitid amc started trading 4 month ago 3k ...
4,1,0,0,$SPY $ATNF $AMC $INTC..Mark this post... I’m g...,spy atnf amc mark post going follower end summ...


Data Prep #3 completed (removal of stopwords and lemmatization) ...



AttributeError: 'MultinomialNB' object has no attribute 'transform'

In [5]:
display(RFC_dfPred)

Unnamed: 0,sentiment_Bullish,sentiment_None,sentiment_Bearish,body_original,body_Processed,Predicted Bullish Sentiment
0,0,1,0,$AMD $ATNF $HITID $INTC $AMC ..🚀🚀Let&#39;s go ...,atnf hitid amc let go big chat tradethemomentu...,0
1,1,0,0,$AMD $ATNF Our team is up 1253% yesterday so f...,atnf team 1253 yesterday far small cap play gr...,0
2,1,0,0,$amd $atnf $intc ..Started trading 5 months ag...,atnf started trading month ago 4k made profit ...,0
3,0,1,0,$amd $atnf $hitid $intc $amc ..Started tradin...,atnf hitid amc started trading 4 month ago 3k ...,0
4,1,0,0,$SPY $ATNF $AMC $INTC..Mark this post... I’m g...,spy atnf amc mark post going follower end summ...,0
...,...,...,...,...,...,...
4615,0,1,0,$MU WDC has outperformed MU by 10% over 2 days...,wdc outperformed 2 day two day insane big run ...,0
4616,1,0,0,$MU GREEEN BABY LET&#39;S GO!,greeen baby let go,0
4617,0,1,0,$MU we’re outperforming the SMH? Never thought...,outperforming smh never thought see day,0
4618,1,0,0,@DocOctagon $MU I&#39;m hoping that with Zinsn...,hoping zinsner cfo barclays wednesday morning ...,0


In [14]:
# MAIN NOT OPTIMIZED MODEL

from sklearn.metrics import roc_curve, auc, classification_report, confusion_matrix, precision_score, recall_score,  accuracy_score, precision_recall_curve



#HERE IS WHERE THE SAVED MODEL AND VOCAB FILES ARE LOADED WITH PICKLE AND RUN
print("\n", 'Naive Bayes Classifier')

'''
#more recent but smaller data set
ModelFileToLoad = '2021-05-22 tech search stocktwitsModel.sav'
VocabFileToLoad = '2021-05-22 tech search stocktwitsVocab.sav'
'''
ModelFileToLoad = 'tech stockTwit 03112021Model.sav'
VocabFileToLoad = 'tech stockTwit 03112021Vocab.sav'

loaded_model = pickle.load(open(ModelFileToLoad, 'rb')) #The saved model is loaded
loaded_vocab = pickle.load(open(VocabFileToLoad, 'rb')) #The saved vocab is loaded

########
# LOADS NEW DATA SET; PROCESSES THE NEW DATA; RUNS THE MODEL ON THE NEW DATA.
########

relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests'
df3 = getData(relevant_path + '/' + '2021-05-22 tech search stocktwits.csv') #returns df; reads csv file into df

#tf_vectorizer = CountVectorizer(vocabulary=tf_vectorizer.vocabulary_)
tf_vectorizer = loaded_vocab

df3 = DataPrep(df3)

X_new = tf_vectorizer.transform(df3['body_Processed'])
   
Y_pred = loaded_model.predict(X_new) #Runs the model on the new data set

display('New Y_pred: ', Y_pred)

df4 = pd.DataFrame(Y_pred, columns=['Predicted Sentiment']) #creates a new df5 from the Y_pred array
dfPred = df3.join(df4) #joins the df3 and df4 dataframes together

accuracy(dfPred)



 Naive Bayes Classifier
1468


Unnamed: 0,messageID,body_original,sentiment,date,time
0,333823830,$INTC Simulated 57.0 dollar CALLS for Monday...,Bullish,2021-05-22,16:18:23
1,333815936,$AMD $MU $INTC $QCOM $NVDA ..Nvda splitting......,,2021-05-22,15:32:39
2,333801174,5 of 11 $HON $IBM $INTC Dark green arrows indi...,,2021-05-22,13:54:06
3,333794667,$INTC Timing the market with short dated opti...,Bullish,2021-05-22,13:06:41
4,333790756,Your daily News digest for Intel $INTC https:/...,,2021-05-22,12:33:05


Index(['sentiment_Bullish', 'sentiment_None', 'sentiment_Bearish',
       'body_original'],
      dtype='object')
Data Prep 1 completed. (df clean up; drop duplicates records, headers, certain rows; reorders columns) 

Wall time: 25 ms
HTTP tags removed. 

Wall time: 1.01 ms
Converted to lower case. 

Wall time: 3.99 ms
Removed punctuatuations. 

Wall time: 5.98 ms
Removed unicodes. 

Data Prep 2 completed. (Removal of HTTP tags, punctuation, unicodes; lower case conversion; 

Wall time: 87.8 ms
Wall time: 148 ms
Wall time: 65.2 ms


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pstri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,sentiment_Bullish,sentiment_None,sentiment_Bearish,body_original,body_Processed
0,1,0,0,$INTC Simulated 57.0 dollar CALLS for Monday...,simulated dollar call monday open stockorbit
1,0,1,0,$AMD $MU $INTC $QCOM $NVDA ..Nvda splitting......,splitting much know debt ceiling impact overal...
2,0,1,0,5 of 11 $HON $IBM $INTC Dark green arrows indi...,dark green arrow indicate strong buy signal da...
3,1,0,0,$INTC Timing the market with short dated opti...,timing market short dated option burn account ...
4,0,1,0,Your daily News digest for Intel $INTC https:/...,daily news digest


Data Prep #3 completed (removal of stopwords and lemmatization) ...



'New Y_pred: '

array([0, 1, 0, ..., 1, 1, 0], dtype=uint8)

Total Correct:  885 ; Percent Correct:  60.2 %
Total Incorrect:  583 ; Percent Incorrrect:  39.7 %


In [75]:
print(len(Y_pred))

1468


In [20]:
print(len(df3))

2417


In [21]:
df4 = pd.DataFrame(Y_pred, columns=['Predicted Sentiment'])

In [22]:
display(df4.head())

Unnamed: 0,Predicted Sentiment
0,1
1,1
2,1
3,0
4,0


In [29]:

dfPred = df3.join(df4)

display(dfPred.head())

Unnamed: 0,sentiment_Bullish,sentiment_None,sentiment_Bearish,body_original,body_Processed,Predicted Sentiment
0,0,1,0,$INTC Big Trade - $16 399 800.270 000 shares a...,big trade share,1
1,0,1,0,Large Print $INTC Size: 270000 Price: 60.74 Ti...,large print size price time 1601 amount,1
2,0,1,0,Huge Print $INTC Size: 4033477 Price: 60.74 Ti...,huge print size price time amount,1
3,1,0,0,$AMD common follow ur sibs $INTC $MU,common follow ur sib,0
4,1,0,0,$ITT $INTC $ADBE $OPTT $GLBS . .,itt optt glbs,0


In [31]:
print(dfPred.columns)

Index(['sentiment_Bullish', 'sentiment_None', 'sentiment_Bearish',
       'body_original', 'body_Processed', 'Predicted Sentiment'],
      dtype='object')


In [48]:
i = 0
right_score = 0
wrong_score = 0
while i < len(dfPred):
    if dfPred.iloc[i,5] == dfPred.iloc[i,0]:
        right_score += 1
    else:
        wrong_score += 1
        
    i += 1
    
print('Total Correct: ', right_score, '; Percent Correct: ', int(right_score/len(dfPred) * 1000)/10, '%')
print('Total Incorrect: ', wrong_score, '; Percent Incorrrect: ', int(wrong_score/len(dfPred) * 1000)/10, '%')

Total Correct:  1347 ; Percent Correct:  55.7 %
Total Incorrect:  1070 ; Percent Incorrrect:  44.2 %


In [13]:
# MAIN

#relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/Scraped Files'
relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests'
#df1 = pd.read_csv('tech stockTwit 03112021-Copy1.csv') #uses data from stocktwits to train the models
#df1 = pd.read_csv('2021-05-22 tech search stocktwits-Copy1.csv') #uses data from stocktwits to train the models

filename = '2021-05-22 tech search stocktwits.csv'

df1 = getData(relevant_path + '/' + filename) #returns df; reads csv file into df

print('Imported the csv file.')

print(df1.columns)

from sklearn.metrics import roc_curve, auc, classification_report, confusion_matrix, precision_score, recall_score,  accuracy_score, precision_recall_curve

#HERE IS WHERE THE SAVED MODEL IS LOADED WITH PICKLE AND RUN
print("\n", 'Naive Bayes Classifier')

ModelFileToLoad = '2021-05-22 tech search stocktwitsModel.sav'
VocabFileToLoad = '2021-05-22 tech search stocktwitsVocab.sav'

loaded_model = pickle.load(open(ModelFileToLoad, 'rb')) #The saved model is loaded
loaded_vocab = pickle.load(open(VocabFileToLoad, 'rb')) #The saved vocab is loaded

########
# LOADS NEW DATA SET; PROCESSES THE NEW DATA; RUNS THE MODEL ON THE NEW DATA.
########

relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests'
df3 = getData(relevant_path + '/' + 'tech stockTwit 03112021.csv') #returns df; reads csv file into df

#tf_vectorizer = CountVectorizer(vocabulary=tf_vectorizer.vocabulary_)
tf_vectorizer = loaded_vocab

df3 = DataPrep(df3)

X_new = tf_vectorizer.transform(df3['body_Processed'])
   
Y_pred = loaded_model.predict(X_new) #Runs the model on the new data set

display('New Y_pred: ', Y_pred)


Imported the csv file.
Index(['symbol', 'messageID ', 'created_at', 'body', 'followers', 'sentiment',
       'date', 'time'],
      dtype='object')

 Naive Bayes Classifier
2417


Unnamed: 0,body_original,sentiment
0,$INTC Big Trade - $16 399 800.270 000 shares a...,
1,Large Print $INTC Size: 270000 Price: 60.74 Ti...,
2,Huge Print $INTC Size: 4033477 Price: 60.74 Ti...,
3,$AMD common follow ur sibs $INTC $MU,Bullish
4,$ITT $INTC $ADBE $OPTT $GLBS . .,Bullish


Index(['sentiment_Bullish', 'sentiment_None', 'sentiment_Bearish',
       'body_original'],
      dtype='object')
Data Prep 1 completed. (df clean up; drop duplicates records, headers, certain rows; reorders columns) 

Wall time: 35.9 ms
HTTP tags removed. 

Wall time: 995 µs
Converted to lower case. 

Wall time: 6.99 ms
Removed punctuatuations. 

Wall time: 6.01 ms
Removed unicodes. 

Data Prep 2 completed. (Removal of HTTP tags, punctuation, unicodes; lower case conversion; 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the',

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pstri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Wall time: 205 ms
Wall time: 85.2 ms


Unnamed: 0,sentiment_Bullish,sentiment_None,sentiment_Bearish,body_original,body_Processed
0,0,1,0,$INTC Big Trade - $16 399 800.270 000 shares a...,big trade share
1,0,1,0,Large Print $INTC Size: 270000 Price: 60.74 Ti...,large print size price time 1601 amount
2,0,1,0,Huge Print $INTC Size: 4033477 Price: 60.74 Ti...,huge print size price time amount
3,1,0,0,$AMD common follow ur sibs $INTC $MU,common follow ur sib
4,1,0,0,$ITT $INTC $ADBE $OPTT $GLBS . .,itt optt glbs


Data Prep #3 completed (removal of stopwords and lemmatization) ...



'New Y_pred: '

array([1, 1, 1, ..., 1, 0, 1], dtype=uint8)

In [8]:
# Data Preparation for Sentiment Analysis

#relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/Scraped Files'
relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests'
#df1 = pd.read_csv('tech stockTwit 03112021-Copy1.csv') #uses data from stocktwits to train the models
#df1 = pd.read_csv('2021-05-22 tech search stocktwits-Copy1.csv') #uses data from stocktwits to train the models

'''print('Here is a list of the csv files to choose from: \n')
exten = 'csv'
list_dir_files(relevant_path, exten)

time.sleep(1)

name = input('\nWhat file do you want to use? ')
df1 = getData(relevant_path + '/' + name) #returns df; reads csv file into df'''

filename = '2021-05-22 tech search stocktwits.csv'

df1 = getData(relevant_path + '/' + filename) #returns df; reads csv file into df

print('Imported the csv file.')

print(df1.columns)

# DATA PREPARATION 1 - cleans up the dataframe: drops duplicate records; removes duplicate headers; removes unnecessary columns; renames body column;
#separates the sentiment column into Bullish, None, Bearish; copies df1 to df2

df1 = df1.drop_duplicates() # removes duplicate records
len(df1)

column = 'symbol'
df1.drop(df1[df1['symbol'] == column].index, inplace=True) #removes duplicate headers

df1 = df1.reset_index(drop = True) # resets the index

#Note: symbol and cretaed_at will be needed for app
df1.drop(['symbol', 'created_at', 'followers'], inplace = True, axis = 1) #deletes columns that are not needed for creating the model
print(len(df1))

df1.rename(columns = {'body' : 'body_original'}, inplace = True) #renmaes body as body_original
display(df1.head())
df1 = pd.get_dummies(df1, columns = ['sentiment'], drop_first = False) # drop_first is false to get all three possibilities (column sentiments)

df2 = df1

df2 = df2[['sentiment_Bullish', 'sentiment_None' , 'sentiment_Bearish', 'body_original']] #reorders the columns

print(df2.columns)

print('Data Prep 1 completed. \n')

# DATA PREPARATION 2 - cleaning up the comments/tweets: Remove HTTP tags, Converts all to lower case, Removes punctuation;
#Removes unicodes

import re
from bs4 import BeautifulSoup

# Remove HTTP tags
%time df2['body_Processed'] = df2['body_original'].map(lambda x : ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",x).split()))
print('HTTP tags removed. \n')

#Lower Case
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x: x.lower())
print('Converted to lower case. \n')

#Remove punctuations
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x: re.sub(r'[^\w\s]', '', x))
print('Removed punctuatuations. \n')

#Remove unicodes
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : re.sub(r'[^\x00-\x7F]+',' ', x))

df2.head()
print('Removed unicodes. \n')

print('Data Prep 2 completed. \n')

# DATA PREPARATION 3 - lemmatization/stopwords: Removes stopwords, Lemmatizes the words, Removes stopwords from lemmatizaation.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# Remove stopwords
stop_words = stopwords.words('english')

#adds new stopwords to list
new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', 'unh', '39', ' 270',
                  '270000', '4033477', '244', '16', '399', '800', '270', '000', '60', '74',
                 '1600', '993', '392', '98', '00', '1601', 'amd', 'aapl', '03', '10', '100',
                  '15', '18', '19', '20', '2021', '57', '0', '5', '11', 'qcom', 'hon', 'ibm',
                 'intel', '05', '12', '13', '14', '17', '21', '22', '30', '50']

new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', '00 call', '00 call entry',]


new_stop_words = ['intc', 'nvda', 'tsla', 'mu', 'msft', 'tsm', 'adbe', 'unh', '39', ' 270',
                  '270000', '4033477', '244', '16', '399', '800', '270', '000', '60', '74',
                 '1600', '993', '392', '98', '00', 'amd', 'aapl']


for w in new_stop_words:
    stop_words.append(w)

print(stop_words)

#removes the stopwords from the column body_Processed
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([w for w in x.split() if w not in stop_words]))
df2.head()

df = df2
# Lemmatize the text
lemmer = WordNetLemmatizer()

import nltk #not in original code
nltk.download('wordnet') #not in original code

%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([lemmer.lemmatize(w) for w in x.split() if w not in stop_words]))
df2.head()

#Removing Stop words again after Lemmatize
%time df2['body_Processed'] = df2['body_Processed'].map(lambda x : ' '.join([w for w in x.split() if w not in stop_words]))
display(df2.head())

# Embedding on the processed text data
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer

# BOW-TF Embedding

no_features = 800
tf_vectorizer = CountVectorizer(min_df=.015, max_df=.8, max_features=no_features, ngram_range=[1, 3])

'''%time tpl_tf = tf_vectorizer.fit_transform(df2['body_Processed'])
display("Bow-TF :", tpl_tf.shape)
df_tf = pd.DataFrame(tpl_tf.toarray(), columns=tf_vectorizer.get_feature_names())
display(df_tf.head())'''
'''
#Preparing processed and BoW-TF embedded data for Classification
df_tf_m = pd.concat([df2, df_tf], axis = 1)
df_tf_m.drop(columns=['body_original', 'body_Processed'], inplace = True)
print(df_tf_m.shape)
display(df_tf_m.head())

# BoW-TF:IDF Embedding
tfidf_vectorizer = TfidfVectorizer(min_df=.02, max_df=.7, ngram_range=[1,3])

%time tpl_tfidf = tfidf_vectorizer.fit_transform(df2['body_Processed'])
display("Bow-TF:IDF :", tpl_tfidf.shape)
df_tfidf = pd.DataFrame(tpl_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names(), index=df2.index)
display(df_tfidf.head())

#Preparing processed and BoW-TF:IDF embedded data for Classification
df_tfidf_m = pd.concat([df2, df_tfidf], axis = 1)
df_tfidf_m.drop(columns=['body_original', 'body_Processed'], inplace = True)
print(df_tfidf_m.shape)
display(df_tfidf_m.head())'''

'''from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier'''
'''from sklearn.model_selection import StratifiedKFold, cross_validate, train_test_split, cross_val_score, KFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder'''

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_curve, auc, classification_report, confusion_matrix, precision_score, recall_score,  accuracy_score, precision_recall_curve

'''# USED TO SET UP THE TRAINING AND TESTING DATA SETS (USED IN MODEL CREATION)
df = df_tf_m
      
Y = df['sentiment_Bullish']
X = df.drop('sentiment_Bullish', axis = 1)
    
display('Y: ', Y)
display('X: ', X)
    
#X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.85, random_state = 21)
X_train, X_test, Y_train, Y_test = train_test_split(tpl_tf, Y, train_size = 0.85, random_state = 21)#NOTE: changing X to the vectorized body column corrects the dimensional mis-match.
#X_train, X_test, Y_train, Y_test = train_test_split(tpl_tfidf, Y, train_size = 0.85, random_state = 21)#NOTE: changing X to the vectorized body column corrects the dimensional mis-match.

print("Train Data Dimensions : ", X_train.shape)
print("Test Data Dimensions : ", X_test.shape)'''
      
#HERE IS WHERE THE SAVED MODEL IS LOADED WITH PICKLE AND RUN
print("\n", 'Naive Bayes Classifier')

'''clf = MultinomialNB(alpha = 1.0)
%time clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
probs = clf.predict_proba(X_test)
clf_report(Y_test, Y_pred, probs)'''

ModelFileToLoad = '2021-05-22 tech search stocktwitsModel.sav'
VocabFileToLoad = '2021-05-22 tech search stocktwitsVocab.sav'

    
loaded_model = pickle.load(open(ModelFileToLoad, 'rb')) #The saved model is loaded
loaded_vocab = pickle.load(open(VocabFileToLoad, 'rb')) #The saved vocab is loaded

########
# LOADS NEW DATA SET; PROCESSES THE NEW DATA; RUNS THE MODEL ON THE NEW DATA.
########

relevant_path = 'C:/Users/pstri/OneDrive/Documents/Personal/Kokoro/NLTK/Code Project/NLP Models/Model Import Tests'
df3 = getData(relevant_path + '/' + 'tech stockTwit 03112021.csv') #returns df; reads csv file into df

print(df3.columns)

#tpl_tf = tf_vectorizer.fit_transform(df2['

#tf_vectorizer = CountVectorizer(vocabulary=tf_vectorizer.vocabulary_)
tf_vectorizer = CountVectorizer(vocabulary=tpl_tf)


df3 = DataPrep(df3)

X_new = tf_vectorizer.transform(df3['body_Processed'])
   
Y_pred = loaded_model.predict(X_new) #Runs the model on the new data set

display('New Y_pred: ', Y_pred)

print(df3.columns)


Imported the csv file.
Index(['symbol', 'messageID ', 'created_at', 'body', 'followers', 'sentiment',
       'date', 'time'],
      dtype='object')
1468


Unnamed: 0,messageID,body_original,sentiment,date,time
0,333823830,$INTC Simulated 57.0 dollar CALLS for Monday...,Bullish,2021-05-22,16:18:23
1,333815936,$AMD $MU $INTC $QCOM $NVDA ..Nvda splitting......,,2021-05-22,15:32:39
2,333801174,5 of 11 $HON $IBM $INTC Dark green arrows indi...,,2021-05-22,13:54:06
3,333794667,$INTC Timing the market with short dated opti...,Bullish,2021-05-22,13:06:41
4,333790756,Your daily News digest for Intel $INTC https:/...,,2021-05-22,12:33:05


Index(['sentiment_Bullish', 'sentiment_None', 'sentiment_Bearish',
       'body_original'],
      dtype='object')
Data Prep 1 completed. 

Wall time: 24.9 ms
HTTP tags removed. 

Wall time: 1.97 ms
Converted to lower case. 

Wall time: 3.99 ms
Removed punctuatuations. 

Wall time: 5 ms
Removed unicodes. 

Data Prep 2 completed. 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'durin

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pstri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,sentiment_Bullish,sentiment_None,sentiment_Bearish,body_original,body_Processed
0,1,0,0,$INTC Simulated 57.0 dollar CALLS for Monday...,simulated 57 0 dollar call monday open stockorbit
1,0,1,0,$AMD $MU $INTC $QCOM $NVDA ..Nvda splitting......,qcom splitting much know debt ceiling impact o...
2,0,1,0,5 of 11 $HON $IBM $INTC Dark green arrows indi...,5 11 hon ibm dark green arrow indicate strong ...
3,1,0,0,$INTC Timing the market with short dated opti...,timing market short dated option burn account ...
4,0,1,0,Your daily News digest for Intel $INTC https:/...,daily news digest intel


Wall time: 103 ms


'Bow-TF :'

(1468, 211)

Unnamed: 0,05,10,100,11,12,13,14,15,15 2021,17,...,unusual,unusual option,unusual option activity,volume,week,weekly,weekly call,worth,would,year
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,2,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0



 Naive Bayes Classifier
Index(['symbol', 'created_at', 'body', 'followers', 'sentiment'], dtype='object')
2417


Unnamed: 0,body_original,sentiment
0,$INTC Big Trade - $16 399 800.270 000 shares a...,
1,Large Print $INTC Size: 270000 Price: 60.74 Ti...,
2,Huge Print $INTC Size: 4033477 Price: 60.74 Ti...,
3,$AMD common follow ur sibs $INTC $MU,Bullish
4,$ITT $INTC $ADBE $OPTT $GLBS . .,Bullish


Index(['sentiment_Bullish', 'sentiment_None', 'sentiment_Bearish',
       'body_original'],
      dtype='object')
Data Prep 1 completed. (df clean up; drop duplicates records, headers, certain rows; reorders columns) 

Wall time: 37.9 ms
HTTP tags removed. 

Wall time: 1.99 ms
Converted to lower case. 

Wall time: 5.01 ms
Removed punctuatuations. 

Wall time: 7.01 ms
Removed unicodes. 

Data Prep 2 completed. (Removal of HTTP tags, punctuation, unicodes; lower case conversion; 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the'

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\pstri\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Wall time: 92.8 ms


Unnamed: 0,sentiment_Bullish,sentiment_None,sentiment_Bearish,body_original,body_Processed
0,0,1,0,$INTC Big Trade - $16 399 800.270 000 shares a...,big trade share
1,0,1,0,Large Print $INTC Size: 270000 Price: 60.74 Ti...,large print size price time 1601 amount
2,0,1,0,Huge Print $INTC Size: 4033477 Price: 60.74 Ti...,huge print size price time amount
3,1,0,0,$AMD common follow ur sibs $INTC $MU,common follow ur sib
4,1,0,0,$ITT $INTC $ADBE $OPTT $GLBS . .,itt optt glbs


Data Prep #3 completed (removal of stopwords and lemmatization) ...



'New Y_pred: '

array([1, 1, 1, ..., 1, 0, 1], dtype=uint8)

Index(['sentiment_Bullish', 'sentiment_None', 'sentiment_Bearish',
       'body_original', 'body_Processed'],
      dtype='object')


Stage 4: Hyper-parameter tuning maodels that used TF-Bow embedding data

In [None]:
from sklearn.model_selection import GridSearchCV

Grid-Search hyperparameter tuing on AdaBoost Classifier

In [None]:
Y = df_tf_m['sentiment_Bullish']
X = df_tf_m.drop('sentiment_Bullish', axis = 1)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size = 0.85, random_state = 21)
print("Train Data Dimensions : ", X_train.shape)
print("Test Data Dimensions : ", X_test.shape)

In [None]:
#Creating a grid of hyperparameters
grid_params = {'n_estimators' : [100,200,300],
               'learning_rate' : [1.0, 0.1, 0.05]}

ABC = AdaBoostClassifier()
#Building a 10 fold CV GridSearchCV object
grid_object = GridSearchCV(estimator = ABC, param_grid = grid_params, scoring = 'roc_auc', cv = 10, n_jobs = -1)

#Fitting the grid to the training data
%time grid_object.fit(X_train, Y_train)

In [None]:
#Extracting the best parameters and score
print("Best Parameters : ", grid_object.best_params_)
print("Best_ROC-AUC : ", round(grid_object.best_score_ * 100, 2))
print("Best model : ", grid_object.best_estimator_)

#Applying the tuned parameters back to the model
Y_pred = grid_object.best_estimator_.predict(X_test)
probs = grid_object.best_estimator_.predict_proba(X_test)
clf_report(Y_test, Y_pred, probs)

kfold = KFold(n_splits=10, random_state=25, shuffle=True)
%time results = cross_val_score(grid_object.best_estimator_, X_test, Y_test, cv=kfold)
results = results * 100
results = np.round(results,2)
print("Cross Validation Accuracy : ", round(results.mean(), 2))
print("Cross Validation Accuracy in every fold : ", results)

In [None]:
grid_params = {'n_estimators' : [100,200,300,400,500],
               'max_depth' : [10, 7, 5, 3],
               'criterion' : ['entropy', 'gini']}

RFC = RandomForestClassifier()
grid_object = GridSearchCV(estimator = RFC, param_grid = grid_params, scoring = 'roc_auc', cv = 10, n_jobs = -1)

%time grid_object.fit(X_train, Y_train)

In [None]:
# print("Best Parameters : ", grid_object.best_params_)
print("Best_ROC-AUC : ", round(grid_object.best_score_ * 100, 2))
print("Best model : ", grid_object.best_estimator_)

Y_pred = grid_object.best_estimator_.predict(X_test)
probs = grid_object.best_estimator_.predict_proba(X_test)
clf_report(Y_test, Y_pred, probs)

kfold = KFold(n_splits=10, random_state=25, shuffle=True)
%time results = cross_val_score(grid_object.best_estimator_, X_test, Y_test, cv=kfold)
results = results * 100
results = np.round(results,2)
print("Cross Validation Accuracy : ", round(results.mean(), 2))
print("Cross Validation Accuracy in every fold : ", results)

###### APPROACH 2: PyCaret

In [None]:
df1 = pd.read_csv('tech stockTwit 03112021-Copy1.csv')

In [None]:
df1 = df1.drop_duplicates() # removes duplicate records
len(df1)

column = 'symbol'
df1.drop(df1[df1['symbol'] == column].index, inplace=True) #removes duplicate headers

df1 = df1.reset_index(drop = True) # resets the index

df1.drop(['symbol', 'created_at', 'followers'], inplace = True, axis = 1) #deletes columns
len(df1)

df1.rename(columns = {'body' : 'body_original'}, inplace = True) #renmaes body as body_original
display(df1.head())
df1 = pd.get_dummies(df1, columns = ['sentiment'], drop_first = False) # drop_first is false to get all three possibilities (column sentiments)

df2 = df1

display(df1.head())
df2 = df2[['sentiment_Bullish', 'sentiment_None' , 'sentiment_Bearish', 'body_original']] #reorders the columns


In [None]:
from nltk.corpus import stopwords

In [None]:
stop_words = stopwords.words('english')

In [None]:
#pip install --user https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

In [None]:
from pycaret.nlp import *

In [None]:
%time su_1 = setup(data = df1, target = 'body_original', custom_stopwords = stop_words, session_id = 21)


Step 2: Embedding on the processed text data

In [None]:
%time m1 = create_model(model='lda', multi_core=True)

In [None]:
%time lda_data = assign_model(m1)

In [None]:
lda_data.head()

In [None]:
evaluate_model(m1)

In [None]:
%time m2 = create_model(model='nmf', multi_core=True)

In [None]:
%time nmf_data = assign_model(m2)

In [None]:
lda_data.head()

In [None]:
lda_data.columns

In [None]:
lda_data.drop(['body_original', 'Dominant_Topic', 'Perc_Dominant_Topic'], axis=1, inplace = True)
lda_data.head()

In [None]:
nmf_data.head()

In [None]:
nmf_data.drop(['body_original', 'Dominant_Topic', 'Perc_Dominant_Topic'], axis=1, inplace = True)
nmf_data.head()

Stage 3: Model Building

In [None]:
from pycaret.classification import *

In [None]:
%time pce_1 = setup(data = lda_data, target = 'sentiment_None', session_id = 5, train_size = 0.85)

In [None]:
%time compare_models()

Stage 4: Hyper-parameters tuning

In [None]:
#step1 : model creation
#%time pce_1_m1 = create_model('rf') #original code does not define 'rf' needed for the next step
%time rf = create_model('rf')

In [None]:
#step2 : model tuning
%time tuned_pce_1_m1 = tune_model('rf')
%time tuned_pce_1_m1 = tune_model(rf)

In [None]:
#step3 : getting insights from model perfromance
%time evaluate_model(tuned_pce_1_m1)

In [None]:
%time pce_2 = setup(data = nmf_data, target = 'sentiment_Bullish', session_id = 5, train_size = 0.85)

In [None]:
%time compare_models()

In [None]:
#step2 : model tuning
%time tuned_pce_2_m1 = tune_model(tuned_pce_1_m1, optimize='AUC')

In [None]:
#step3 : getting insights from model perfromance
%time evaluate_model(tuned_pce_2_m1)

In [None]:
#funtion to get 'top N' or 'bottom N' words
#from PyCaret example

def get_n_words(corpus, direction, n):
    vec = CountVectorizer(stop_words = 'english').fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    if direction == "top":
        words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    else:
        words_freq =sorted(words_freq, key = lambda x: x[1], reverse=False)
    return words_freq[:n]

#10 most common and 10 most rare words
common_words = get_n_words(df2['body_Processed'], "top", 15)
rare_words = get_n_words(df2['body_Processed'], "bottom", 15)

common_words = dict(common_words)
names = list(common_words.keys())
values = list(common_words.values())
plt.subplots(figsize = (15,10))
bars = plt.bar(range(len(common_words)),values,tick_label=names)
plt.title('15 most common words:')
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x(), yval + .01, yval)
plt.show()

rare_words = dict(rare_words)
names = list(rare_words.keys())
values = list(rare_words.values())
plt.subplots(figsize = (15,10))
bars = plt.bar(range(len(rare_words)),values,tick_label=names)
plt.title('15 most rare words:')
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x(), yval + .001, yval)
plt.show()