# Part 9 -- Predictive Modeling

Prepare data for **Singular Value Decomposition (SVD)**.

### Load lib codes:

In [1]:
!pwd

/home/jovyan/work/Portfolio/Analyzing_Unstructured_Data_for_Finance/ipynb


In [2]:
from os import chdir
chdir('/home/jovyan/work/Portfolio/predicting_stock_market_trends_with_Twitter/')

from lib import *
from lib.twitter_keys import my_keys
# suppress_warnings()

# DO IT ON RAW DATA --> TFIDF TO LOGISTIC REGRESSION --> GRIDSEARCH TUNE LOG REG & TRY RF, SVC, KNN.

* THEN do a GridSearch on the best one
* THEN CIRCLE BCAK AND EDA ON CLUSTERS

## Logistic Regression

In [20]:
from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [10]:
X = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/4.X.pickle')

In [30]:
y = joblib.load('../Analyzing_Unstructured_Data_for_Finance/data/5.y_le.pickle')

In [26]:
X = X['cleaned_text']

In [27]:
X.shape

(77258,)

In [31]:
y.shape

(77258,)

In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

**Regularization: The problem of overfitting**

You don't want your hypothesis to have high bias (underfit) or take too many features and the learned hypthesis will learn the training set really well, but not generalize to new data as well (predict prices on new data). 

If you think overfitting is occurring, you can REGULARIZATION. Keep all features, but reduce the magnitude. This method works well when you have LOTS of features that contribute a little bit to the value of y, so you might not want to throw them away. Regularization (LASSO) 

In [33]:
tfidf_lr_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,3), min_df=20, stop_words='english')),
    ('lr', LogisticRegression(C=1E10))
])


In [34]:
tfidf_svd_lr_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,3), min_df=20, stop_words='english')),
    ('svd', TruncatedSVD(n_components = 10, random_state=42)),
    ('lr', LogisticRegression(C=1E10))
])


In [35]:
tfidf_lr_pipe.fit(X, y)

Pipeline(steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
...ty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False))])

In [40]:
tfidf_lr_pipe.score(X, y)

0.65022392503041759

In [72]:
params = {
    'penalty':['l2','l1'],
    'C':np.logspace(-3,3,30)
}

In [73]:
gs_lr = GridSearchCV(LogisticRegression(random_state=42, n_jobs=-1), param_grid=params, n_jobs=-1,cv=StratifiedShuffleSplit(random_state=42))

In [79]:
start = datetime.now()

gs_lr.fit(X_train,y_train)

end = datetime.now()
print(end - start)

GridSearchCV(cv=StratifiedShuffleSplit(n_splits=10, random_state=42, test_size=0.1,
            train_size=None),
       error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=42, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': array([  1.00000e-03,   1.61026e-03,   2.59294e-03,   4.17532e-03,
         6.72336e-03,   1.08264e-02,   1.74333e-02,   2.80722e-02,
         4.52035e-02,   7.27895e-02,   1.17210e-01,   1.88739e-01,
         3.03920e-01,   4.89390e-01,   7.88046e-01,   1.26896e+00,
         2.0433...5e+02,   2.39503e+02,   3.85662e+02,
         6.21017e+02,   1.00000e+03]), 'penalty': ['l2', 'l1']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [80]:
gs_lr.best_estimator_

LogisticRegression(C=1.2689610031679222, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=-1, penalty='l2', random_state=42,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [None]:
gs_lr_lr.best_estimator_.coef_

In [81]:
results = pd.DataFrame(gs_lr.cv_results_)
results[results['rank_test_score'] < 5].T

Unnamed: 0,30,31,32,34
mean_fit_time,0.45732,0.56292,0.512088,0.584101
mean_score_time,0.000864768,0.000838161,0.000855446,0.000875425
mean_test_score,0.898346,0.898346,0.898309,0.898309
mean_train_score,0.898444,0.898975,0.89893,0.899649
param_C,1.26896,1.26896,2.04336,3.29034
param_penalty,l2,l1,l2,l2
params,"{'C': 1.26896100317, 'penalty': 'l2'}","{'C': 1.26896100317, 'penalty': 'l1'}","{'C': 2.04335971786, 'penalty': 'l2'}","{'C': 3.29034456231, 'penalty': 'l2'}"
rank_test_score,1,1,3,3
split0_test_score,0.898364,0.897997,0.898364,0.898364
split0_train_score,0.898536,0.899128,0.899046,0.89972


In [82]:
print('score:', gs_lr.best_estimator_.score(X, y))
print('predict_proba:', gs_lr.best_estimator_.predict_proba(X))

score: 0.898979501809
predict_proba: [[ 0.95657806  0.02225322  0.02116871]
 [ 0.96692168  0.02142833  0.01165   ]
 [ 0.86838372  0.05792451  0.07369177]
 ..., 
 [ 0.889696    0.08866285  0.02164115]
 [ 0.92811989  0.0269709   0.04490921]
 [ 0.95176279  0.02382861  0.0244086 ]]


In [83]:
from sklearn.metrics import classification_report, confusion_matrix

In [86]:
y_test_pred = gs_lr.predict(X_test)

In [87]:
confusion_matrix(y_test, y_test_pred)

array([[12247,     4,     0],
       [  962,     4,     0],
       [  385,     0,     0]])

In [None]:
# Look at f1-score 

In [88]:
print(classification_report(y_test, y_test_pred))

             precision    recall  f1-score   support

          0       0.90      1.00      0.95     12251
          1       0.50      0.00      0.01       966
          2       0.00      0.00      0.00       385

avg / total       0.85      0.90      0.85     13602



  'precision', 'predicted', average, warn_for)


In [92]:
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV, StratifiedShuffleSplit
from sklearn.naive_bayes import MultinomialNB

from sklearn.linear_model import LogisticRegression

# RUN THIS OVERNIGHT

In [None]:
param_dict = {
    'mnb':{
        'alpha':np.linspace(.1,1,5)
    },
    'lr':{
        'C':np.logspace(-3,3,7)
    },
    'rf':{
        'n_estimators': np.arange(5,15),
        'max_depth': np.arange(1,5)
    },
    'svc':{
        'C': np.logspace(-3,3,7)
    }
}

In [None]:
model_dict = {
    'mnb':GridSearchCV(MultinomialNB(),
                             param_grid=param_dict['mnb'],
                             cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'lr':GridSearchCV(LogisticRegression(random_state=42, n_jobs=-1),
                             param_grid=param_dict['lr'],
                             cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'rf':GridSearchCV(RandomForestClassifier(random_state=42, n_jobs=-1),
                      param_grid=param_dict['rf'],
                      cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'svc':GridSearchCV(SVC(random_state=42),
                      param_grid=param_dict['svc'],
                      cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
}

In [None]:
def fit_all_models(X,y, model_dict):
    for model in model_dict.keys():
        model_dict[model].fit(X,y)
        print("{:5} best score: {}".format(model, model_dict[model].best_score_))

In [None]:
start = datetime.now()

fit_all_models(X_train, y_train, model_dict)

end = datetime.now()
print(end - start)

In [93]:
param_dict = {
    'mnb':{
        'alpha':np.linspace(.1,1,5)
    },
    'lr':{
        'C':np.logspace(-3,3,7)
    },
    'rf':{}
}

In [94]:
model_dict = {
    'mnb':GridSearchCV(MultinomialNB(),
                             param_grid=param_dict['mnb'],
                             cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'lr':GridSearchCV(LogisticRegression(),
                             param_grid=param_dict['lr'],
                             cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
    'rf':GridSearchCV(RandomForestClassifier(),
                      param_grid=param_dict['rf'],
                      cv=StratifiedShuffleSplit(n_splits=5, random_state=42)),
}

In [101]:
def fit_all_models(X,y, model_dict):
    for model in model_dict.keys():
        model_dict[model].fit(X,y)
        print("{:5} best score: {}".format(model, model_dict[model].best_score_))

In [102]:
fit_all_models(X_train, y_train, model_dict)

lr    best score: 0.8982539974269436
mnb   best score: 0.8983642712736629
rf    best score: 0.8863811799301599


In [None]:
tfidf = joblib.load('')

In [None]:
down_df = pd.DataFrame(X_train.todense()[y==0], columns=tfidf.get_feature_names())
neutral_df = pd.DataFrame(X_train.todense()[y==1], columns=tfidf.get_feature_names())
up_df = pd.DataFrame(X_train.todense()[y==2], columns=tfidf.get_feature_names())

In [None]:
down_df.sum().sort_values(ascending=False)[:20]

In [None]:
neutral_df.sum().sort_values(ascending=False)[:20]

In [None]:
up_df.sum().sort_values(ascending=False)[:20]

In [None]:
Shift y's by 1 so it predicts TOMORROW's close

USE MNB cuz it sounds cool


**TRY XGBOOST**

# PULL NEW DATA FROM THESE 30 peo0ple, make new test set, use your encoder, ...9/10 tweets predicted stocks correctly... 

Look at tweets (time) if it went out before Close 
- tweets vs. what happened that day (up/ddown/etc) vs. prediction (up/down)
- do for every tweet
- put in timestamp (0-24)
    - adjust it so everyone is on the same timestamp
    - chunk data 
    - see what the accuracy was in the morning vs after market has closed - does my model accuracy change? 
        - intuition: if tweets were after market closed, thats why scores are so good?
        - OR NOT
        CAN I LOOK AT THESE TWEETS BEFORE THE MARKET OPENS AND PREDICT WHAT HAPPENS
        
        IDENTIFIED THE 30 PEOPLE TO LISTEN TO
        - USE LSA to find more people to listen to (who tweets similar - influencers)
       
       
       
# ENSEMBLING
Building ensemble models based on chunking hours of the day to create new featurs (is NY market open? China? Day of week?)
- can you chunk your input data (Xy grouped together) into a couple different SMARTLY chosen chunks and build a diff model for each one. 
- one model for: is it morning and NYC hasnt opened yet?
- one model for: amrket is open (morning), (evening), close

COMPLETELY SPLIT YOUR X's and Y's BEFORE doing anything to it. Is Twitter reactionary or causal? Do you get a higher/lower score? 


