# Part 9 -- Model Selection

Using pipelines to find the best predictive model for our data.

**Load lib codes**

In [1]:
!pwd

/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/ipynb


In [2]:
from os import chdir
chdir('/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/')

from lib import *
# suppress_warnings()

**Try a bunch of naive models to see which works best with our data**<br>
Next, we will GridSearch over our final model and tune its hyperparameters.

In [3]:
X = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/4.X.pickle')

In [4]:
y = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/5.y_le.pickle')

In [5]:
X = X['cleaned_text']

In [6]:
X_tfidf = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/8.X_tfidf.pickle')

In [9]:
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit, train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=.2, random_state=42)

In [11]:
X_train.shape

(61806, 6369)

In [12]:
y_train.shape

(61806,)

In [13]:
param_dict = {}

In [14]:
model_dict = {
    'mnb':GridSearchCV(MultinomialNB(),
                             param_grid=param_dict,
                             cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'lr':GridSearchCV(LogisticRegression(),
                             param_grid=param_dict,
                             cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'rf':GridSearchCV(RandomForestClassifier(),
                      param_grid=param_dict,
                      cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'svc':GridSearchCV(SVC(),
                      param_grid=param_dict,
                      cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'knn':GridSearchCV(SVC(),
                      param_grid=param_dict,
                      cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
}

In [15]:
def fit_all_models(X,y, model_dict):
    for model in model_dict.keys():
        model_dict[model].fit(X,y)
        print("{:5} best score: {}".format(model, model_dict[model].best_score_))

In [None]:
start = datetime.now()

fit_all_models(X_train, y_train, model_dict)

end = datetime.now()
print(end - start)

**GridSearch a pipeline**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [None]:
tfidf_lr_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,3), min_df=20, stop_words='english')),
    ('lr', LogisticRegression(C=1E10))
])


In [None]:
tfidf_lr_pipe.fit(X, y)

In [None]:
tfidf_lr_pipe.score(X, y)

**Logistic Regression on tfidf_svd data**

In [None]:
tfidf_svd_lr_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,3), min_df=20, stop_words='english')),
    ('svd', TruncatedSVD(n_components = 10, random_state=42)),
    ('lr', LogisticRegression(C=1E10))
])


In [None]:
tfidf_svd_lr_pipe.fit(X, y)

In [None]:
tfidf_svd_lr_pipe.score(X, y)

**Regularization: The problem of overfitting**

You don't want your hypothesis to have high bias (underfit) or take too many features and the learned hypthesis will learn the training set really well, but not generalize to new data as well (predict prices on new data). 

If you think overfitting is occurring, you can REGULARIZATION. Keep all features, but reduce the magnitude. This method works well when you have LOTS of features that contribute a little bit to the value of y, so you might not want to throw them away. Regularization (LASSO) 

In [None]:
down_df = pd.DataFrame(X_train.todense()[y==0], columns=tfidf.get_feature_names())
neutral_df = pd.DataFrame(X_train.todense()[y==1], columns=tfidf.get_feature_names())
up_df = pd.DataFrame(X_train.todense()[y==2], columns=tfidf.get_feature_names())

In [None]:
down_df.sum().sort_values(ascending=False)[:20]

In [None]:
neutral_df.sum().sort_values(ascending=False)[:20]

In [None]:
up_df.sum().sort_values(ascending=False)[:20]

In [None]:
Shift y's by 1 so it predicts TOMORROW's close

USE MNB cuz it sounds cool


**TRY XGBOOST**

# PULL NEW DATA FROM THESE 30 peo0ple, make new test set, use your encoder, ...9/10 tweets predicted stocks correctly... 

Look at tweets (time) if it went out before Close 
- tweets vs. what happened that day (up/ddown/etc) vs. prediction (up/down)
- do for every tweet
- put in timestamp (0-24)
    - adjust it so everyone is on the same timestamp
    - chunk data 
    - see what the accuracy was in the morning vs after market has closed - does my model accuracy change? 
        - intuition: if tweets were after market closed, thats why scores are so good?
        - OR NOT
        CAN I LOOK AT THESE TWEETS BEFORE THE MARKET OPENS AND PREDICT WHAT HAPPENS
        
        IDENTIFIED THE 30 PEOPLE TO LISTEN TO
        - USE LSA to find more people to listen to (who tweets similar - influencers)
       
       
       
# ENSEMBLING
Building ensemble models based on chunking hours of the day to create new featurs (is NY market open? China? Day of week?)
- can you chunk your input data (Xy grouped together) into a couple different SMARTLY chosen chunks and build a diff model for each one. 
- one model for: is it morning and NYC hasnt opened yet?
- one model for: amrket is open (morning), (evening), close

COMPLETELY SPLIT YOUR X's and Y's BEFORE doing anything to it. Is Twitter reactionary or causal? Do you get a higher/lower score? 


