# Modeling & Final Model Selection

---


![image](../Files/model-selection.png)

As shown, our models using the original dataset are not performing well due to unbalanced classes. So we came up with soluations such as changing the ratio of classes by **increasing the minority class** and **decreasing majority class**, and also using the ensemble models. Since `increasing target == 1` method is basically making up datas, we decided to go with `decreasing target == 0` method.

As a result, our pick for the final model is **Model 8**, a Bagging Classifier built on decreased data.

In [1]:
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.metrics import roc_auc_score, recall_score, f1_score

In [2]:
df = pd.read_csv("../Data/twitter_target.csv")

In [3]:
df.shape

(18990, 6)

**We have a very unbalanced dataset - this could be a problem**

In [4]:
df['target'].value_counts()

0    18048
1      942
Name: target, dtype: int64

In [5]:
df['target'].value_counts(normalize = True)

0    0.950395
1    0.049605
Name: target, dtype: float64

In [6]:
# Assign X and y
X = df[['tweet']]
y = df['target']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 19, stratify = y)

### Logistic Regression Model

In [7]:
pipe_lgr = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('lgr', LogisticRegression())
])

params_pipe_lgr = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'lgr__C': [1.0, 2.0],
    'lgr__penalty': ['l1', 'l2']
}

gs_pipe_lgr = GridSearchCV(pipe_lgr, params_pipe_lgr, cv = 5)
gs_pipe_lgr.fit(X_train['tweet'], y_train)







GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [8]:
print(f'Best params: {gs_pipe_lgr.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_lgr.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_lgr.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')

Best params: {'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'lgr__C': 1.0, 'lgr__penalty': 'l1'}
------------------------------------
Training score: 0.9620137621120629
Testing score: 0.952190395956192
ROC AUC score: 0.5471773951196057
Recall score: 0.09745762711864407
F1 score: 0.1684981684981685


### Bagging Classifier

In [9]:
pipe_bag = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('bag', BaggingClassifier())
])

params_pipe_bag = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'bag__n_estimators': [8, 10]
}

gs_pipe_bag = GridSearchCV(pipe_bag, params_pipe_bag, cv = 5)
gs_pipe_bag.fit(X_train['tweet'], y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [10]:
print(f'Best params: {gs_pipe_bag.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_bag.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_bag.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')

Best params: {'bag__n_estimators': 10, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}
------------------------------------
Training score: 0.9863783176520151
Testing score: 0.9448188711036226
ROC AUC score: 0.6216041591537445
Recall score: 0.2627118644067797
F1 score: 0.3212435233160622


### Random Forest Classifier

In [11]:
pipe_rf = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('rf', RandomForestClassifier(random_state= 19))
])

params_pipe_rf = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'rf__max_depth': [5, 10]
}

gs_pipe_rf = GridSearchCV(pipe_rf, params_pipe_rf, cv = 5)
gs_pipe_rf.fit(X_train['tweet'], y_train)





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [12]:
print(f'Best params: {gs_pipe_rf.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_rf.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_rf.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')

Best params: {'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'rf__max_depth': 5}
------------------------------------
Training score: 0.9504283106305295
Testing score: 0.9502948609941028
ROC AUC score: 0.5
Recall score: 0.0
F1 score: 0.0


  'precision', 'predicted', average, warn_for)


## TFIDF

Let's try to use Tfidf instead of CountVectorizer to see if our models perform better:

### Logistic Regression Model

In [13]:
pipe_lgr = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('lgr', LogisticRegression())
])

params_pipe_lgr = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'lgr__C': [1.0, 2.0],
    'lgr__penalty': ['l1', 'l2']
}

gs_pipe_lgr = GridSearchCV(pipe_lgr, params_pipe_lgr, cv = 5)
gs_pipe_lgr.fit(X_train['tweet'], y_train)







GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [14]:
print(f'Best params: {gs_pipe_lgr.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_lgr.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_lgr.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')

Best params: {'lgr__C': 2.0, 'lgr__penalty': 'l1', 'tfidf__min_df': 3, 'tfidf__ngram_range': (1, 1)}
------------------------------------
Training score: 0.9587136638112624
Testing score: 0.9519797809604044
ROC AUC score: 0.5450587510518091
Recall score: 0.09322033898305085
F1 score: 0.16176470588235295


### Bagging Classifier

In [15]:
pipe_bag = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('bag', BaggingClassifier())
])

params_pipe_bag = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'bag__n_estimators': [8, 10]
}

gs_pipe_bag = GridSearchCV(pipe_bag, params_pipe_bag, cv = 5)
gs_pipe_bag.fit(X_train['tweet'], y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [16]:
print(f'Best params: {gs_pipe_bag.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_bag.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_bag.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')

Best params: {'bag__n_estimators': 10, 'tfidf__min_df': 2, 'tfidf__ngram_range': (1, 1)}
------------------------------------
Training score: 0.9852548799325938
Testing score: 0.947978096040438
ROC AUC score: 0.5891333092919822
Recall score: 0.1906779661016949
F1 score: 0.26706231454005935


### Random Forest Classifier

In [17]:
pipe_rf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('rf', RandomForestClassifier(random_state= 19))
])

params_pipe_rf = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'rf__max_depth': [5, 10]
}

gs_pipe_rf = GridSearchCV(pipe_rf, params_pipe_rf, cv = 5)
gs_pipe_rf.fit(X_train['tweet'], y_train)





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [18]:
print(f'Best params: {gs_pipe_rf.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_rf.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_rf.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')

Best params: {'rf__max_depth': 10, 'tfidf__min_df': 3, 'tfidf__ngram_range': (1, 2)}
------------------------------------
Training score: 0.9509900294902401
Testing score: 0.9502948609941028
ROC AUC score: 0.5020078284649597
Recall score: 0.00423728813559322
F1 score: 0.008403361344537815


**Since our models are performing poorly due to unblanced classes, we want to manipulate our data to improve our model performances:**

### Resampling - Decreasing class 0

In [7]:
dropping_list = np.random.choice(list(df[df['target'] == 0].index), 17000, replace = False)

In [8]:
df_decrease_class_0 = df.drop(dropping_list)

In [9]:
df_decrease_class_0['target'].value_counts()

0    1048
1     942
Name: target, dtype: int64

In [10]:
df_decrease_class_0['target'].value_counts(normalize = True)

0    0.526633
1    0.473367
Name: target, dtype: float64

Now, we have 52% vs 48% ratio between our two classes.

In [11]:
# Assign X and y
X = df_decrease_class_0[['tweet']]
y = df_decrease_class_0['target']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 19, stratify = y)

### Logistic Regression Model

In [24]:
pipe_lgr = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('lgr', LogisticRegression())
])

params_pipe_lgr = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'lgr__C': [1.0, 2.0],
    'lgr__penalty': ['l1', 'l2']
}

gs_pipe_lgr = GridSearchCV(pipe_lgr, params_pipe_lgr, cv = 5)
gs_pipe_lgr.fit(X_train['tweet'], y_train)







GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [25]:
print(f'Best params: {gs_pipe_lgr.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_lgr.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_lgr.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')

Best params: {'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'lgr__C': 1.0, 'lgr__penalty': 'l2'}
------------------------------------
Training score: 0.9242627345844504
Testing score: 0.6586345381526104
ROC AUC score: 0.6522350886272481
Recall score: 0.5296610169491526
F1 score: 0.5952380952380952


### Bagging Classifier

In [12]:
pipe_bag = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('bag', BaggingClassifier())
])

params_pipe_bag = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'bag__n_estimators': [8, 10]
}

gs_pipe_bag = GridSearchCV(pipe_bag, params_pipe_bag, cv = 5)
gs_pipe_bag.fit(X_train['tweet'], y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [13]:
print(f'Best params: {gs_pipe_bag.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_bag.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_bag.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')

Best params: {'bag__n_estimators': 10, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2)}
------------------------------------
Training score: 0.9631367292225201
Testing score: 0.6546184738955824
ROC AUC score: 0.6519924957950576
Recall score: 0.6016949152542372
F1 score: 0.6228070175438597


### Random Forest Classifier

In [28]:
pipe_rf = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('rf', RandomForestClassifier(random_state= 19))
])

params_pipe_rf = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'rf__max_depth': [5, 10]
}

gs_pipe_rf = GridSearchCV(pipe_rf, params_pipe_rf, cv = 5)
gs_pipe_rf.fit(X_train['tweet'], y_train)





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [29]:
print(f'Best params: {gs_pipe_rf.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_rf.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_rf.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')

Best params: {'cvec__min_df': 3, 'cvec__ngram_range': (1, 2), 'rf__max_depth': 10}
------------------------------------
Training score: 0.7037533512064343
Testing score: 0.6244979919678715
ROC AUC score: 0.6071775132617414
Recall score: 0.2754237288135593
F1 score: 0.41009463722397477


## TFIDF

Let's try to use Tfidf instead of CountVectorizer to see if our models perform better:

### Logistic Regression Model

In [30]:
pipe_lgr = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('lgr', LogisticRegression())
])

params_pipe_lgr = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'lgr__C': [1.0, 2.0],
    'lgr__penalty': ['l1', 'l2']
}

gs_pipe_lgr = GridSearchCV(pipe_lgr, params_pipe_lgr, cv = 5)
gs_pipe_lgr.fit(X_train['tweet'], y_train)







GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [31]:
print(f'Best params: {gs_pipe_lgr.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_lgr.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_lgr.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')

Best params: {'lgr__C': 1.0, 'lgr__penalty': 'l2', 'tfidf__min_df': 2, 'tfidf__ngram_range': (1, 2)}
------------------------------------
Training score: 0.8672922252010724
Testing score: 0.678714859437751
ROC AUC score: 0.6711088109716651
Recall score: 0.5254237288135594
F1 score: 0.607843137254902


### Bagging Classifier

In [32]:
pipe_bag = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('bag', BaggingClassifier())
])

params_pipe_bag = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'bag__n_estimators': [8, 10]
}

gs_pipe_bag = GridSearchCV(pipe_bag, params_pipe_bag, cv = 5)
gs_pipe_bag.fit(X_train['tweet'], y_train)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [33]:
print(f'Best params: {gs_pipe_bag.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_bag.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_bag.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_bag.predict(X_test["tweet"]))}')

Best params: {'bag__n_estimators': 8, 'tfidf__min_df': 2, 'tfidf__ngram_range': (1, 2)}
------------------------------------
Training score: 0.9457104557640751
Testing score: 0.6506024096385542
ROC AUC score: 0.6401863112951223
Recall score: 0.4406779661016949
F1 score: 0.5445026178010471


### Random Forest Classifier

In [34]:
pipe_rf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('rf', RandomForestClassifier(random_state= 19))
])

params_pipe_rf = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'rf__max_depth': [5, 10]
}

gs_pipe_rf = GridSearchCV(pipe_rf, params_pipe_rf, cv = 5)
gs_pipe_rf.fit(X_train['tweet'], y_train)





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [35]:
print(f'Best params: {gs_pipe_rf.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_rf.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_rf.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')

Best params: {'rf__max_depth': 10, 'tfidf__min_df': 3, 'tfidf__ngram_range': (1, 2)}
------------------------------------
Training score: 0.717828418230563
Testing score: 0.6164658634538153
ROC AUC score: 0.6033283736576529
Recall score: 0.3516949152542373
F1 score: 0.46498599439775906


**We also want to see if increasing class 1 by sampling with replacement would work:**

### Resampling - Increasing class 1

In [36]:
df.shape

(18990, 6)

In [37]:
increasing_list = np.random.choice(list(df[df['target']==1].index),17000,replace = True)

In [38]:
df_increase_class_1 = pd.concat([df, df.iloc[increasing_list]])

In [39]:
df_increase_class_1['target'].value_counts()

0    18048
1    17942
Name: target, dtype: int64

In [40]:
df_increase_class_1['target'].value_counts(normalize = True)

0    0.501473
1    0.498527
Name: target, dtype: float64

In [41]:
# Assign X and y
X = df_increase_class_1[['tweet']]
y = df_increase_class_1['target']

# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 19, stratify = y)

### Logistic Regression Model

In [42]:
pipe_lgr = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('lgr', LogisticRegression())
])

params_pipe_lgr = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'lgr__C': [1.0, 2.0],
    'lgr__penalty': ['l1', 'l2']
}

gs_pipe_lgr = GridSearchCV(pipe_lgr, params_pipe_lgr, cv = 5)
gs_pipe_lgr.fit(X_train['tweet'], y_train)







GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [43]:
print(f'Best params: {gs_pipe_lgr.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_lgr.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_lgr.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')

Best params: {'cvec__min_df': 3, 'cvec__ngram_range': (1, 2), 'lgr__C': 2.0, 'lgr__penalty': 'l2'}
------------------------------------
Training score: 0.9864033787788975
Testing score: 0.9796621471438097
ROC AUC score: 0.9796693633937578
Recall score: 0.9821667409719126
F1 score: 0.9796553640911618


### Random Forest Classifier

In [44]:
pipe_rf = Pipeline([
    ('cvec', CountVectorizer(stop_words = 'english')),
    ('rf', RandomForestClassifier(random_state= 19))
])

params_pipe_rf = {
    'cvec__min_df': [2, 3],
    'cvec__ngram_range': [(1,1), (1,2)],
    'rf__max_depth': [5, 10]
}

gs_pipe_rf = GridSearchCV(pipe_rf, params_pipe_rf, cv = 5)
gs_pipe_rf.fit(X_train['tweet'], y_train)





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                            

In [45]:
print(f'Best params: {gs_pipe_rf.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_rf.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_rf.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')

Best params: {'cvec__min_df': 2, 'cvec__ngram_range': (1, 1), 'rf__max_depth': 10}
------------------------------------
Training score: 0.6734217545939538
Testing score: 0.6697043787508336
ROC AUC score: 0.6690060467870096
Recall score: 0.4273294694605439
F1 score: 0.5633264766382604


## Tfidf

### Logistic Regression Model

In [46]:
pipe_lgr = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('lgr', LogisticRegression())
])

params_pipe_lgr = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'lgr__C': [1.0, 2.0],
    'lgr__penalty': ['l1', 'l2']
}

gs_pipe_lgr = GridSearchCV(pipe_lgr, params_pipe_lgr, cv = 5)
gs_pipe_lgr.fit(X_train['tweet'], y_train)







GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [47]:
print(f'Best params: {gs_pipe_lgr.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_lgr.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_lgr.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_lgr.predict(X_test["tweet"]))}')

Best params: {'lgr__C': 2.0, 'lgr__penalty': 'l2', 'tfidf__min_df': 3, 'tfidf__ngram_range': (1, 2)}
------------------------------------
Training score: 0.9819946650859513
Testing score: 0.9742164925539009
ROC AUC score: 0.9742393988547506
Recall score: 0.9821667409719126
F1 score: 0.9743476337903583


### Random Forest Classifier

In [48]:
pipe_rf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words = 'english')),
    ('rf', RandomForestClassifier(random_state= 19))
])

params_pipe_rf = {
    'tfidf__min_df': [2, 3],
    'tfidf__ngram_range': [(1,1), (1,2)],
    'rf__max_depth': [5, 10]
}

gs_pipe_rf = GridSearchCV(pipe_rf, params_pipe_rf, cv = 5)
gs_pipe_rf.fit(X_train['tweet'], y_train)





GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('tfidf',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [49]:
print(f'Best params: {gs_pipe_rf.best_params_}')
print("------------------------------------")
print(f'Training score: {gs_pipe_rf.score(X_train["tweet"], y_train)}')
print(f'Testing score: {gs_pipe_rf.score(X_test["tweet"], y_test)}')
print(f'ROC AUC score: {roc_auc_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'Recall score: {recall_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')
print(f'F1 score: {f1_score(y_test, gs_pipe_rf.predict(X_test["tweet"]))}')

Best params: {'rf__max_depth': 10, 'tfidf__min_df': 2, 'tfidf__ngram_range': (1, 1)}
------------------------------------
Training score: 0.6977252519264967
Testing score: 0.6933763058457435
ROC AUC score: 0.6931267943926416
Recall score: 0.6067766384306732
F1 score: 0.6636596367182738


**The result is pretty good -- but we are creating a lot of duplicates (basically making up data), the model should not be used.**