# Models
These are the first models to see how they do before I continue scraping the next days.  I choose Multinomial Naive Bayes right off and want to do TFIDF with it, and vectorize with Count Vectorizer anyway beforehand just to see the word count because it is easily interpretable.

# Read in art and hhop csvs and vectorize
This from the first two scrapes

In [1]:
import pandas as pd

In [184]:
#read in first art/h-hop csvs
art =pd.read_csv('../data/dfs/art3cols_csv')
hhop =pd.read_csv('../data/dfs/hhop3cols_csv')

In [185]:
print(art.shape)
hhop.shape

(998, 4)


(740, 4)

In [186]:
#merge dfs
df = pd.merge(art, hhop, how="outer")
#check shape and head and tail
df.shape

(1738, 4)

In [187]:
df.head(2)

Unnamed: 0.1,Unnamed: 0,post,title,subreddit
0,0,,Brexcellent!,streetart
1,1,,Graffiti - LIDE - Yellow Letters,streetart


In [188]:
df.tail(2)

Unnamed: 0.1,Unnamed: 0,post,title,subreddit
1736,67734,,[FRESH] Kobie Dee - This Life ft. Bea Moon. — ...,hiphopheads
1737,67811,,[FRESH VIDEO] * NEXT UP * DREEKDADON- ‘DON STY...,hiphopheads


In [189]:
#remove unnamed
df.drop("Unnamed: 0", axis=1, inplace=True)
df.head(2)

Unnamed: 0,post,title,subreddit
0,,Brexcellent!,streetart
1,,Graffiti - LIDE - Yellow Letters,streetart


In [190]:
# and add a df for later so I can map hhop as 1 also
HHop =df
HHop.tail(2)

Unnamed: 0,post,title,subreddit
1736,,[FRESH] Kobie Dee - This Life ft. Bea Moon. — ...,hiphopheads
1737,,[FRESH VIDEO] * NEXT UP * DREEKDADON- ‘DON STY...,hiphopheads


In [9]:
# Reset index.  I know I didnt' drop any rows, but just to be sure.
df.reset_index(drop=True, inplace=True)
df.tail(2)

Unnamed: 0,post,title,subreddit
1736,,[FRESH] Kobie Dee - This Life ft. Bea Moon. — ...,hiphopheads
1737,,[FRESH VIDEO] * NEXT UP * DREEKDADON- ‘DON STY...,hiphopheads


In [10]:
#map 1 for art and 0 for music
df['subreddit'] =df['subreddit'].map({'hiphopheads': 0, 'streetart': 1})
df.head(2)

Unnamed: 0,post,title,subreddit
0,,Brexcellent!,1
1,,Graffiti - LIDE - Yellow Letters,1


In [11]:
df.tail(2)

Unnamed: 0,post,title,subreddit
1736,,[FRESH] Kobie Dee - This Life ft. Bea Moon. — ...,0
1737,,[FRESH VIDEO] * NEXT UP * DREEKDADON- ‘DON STY...,0


In [12]:
#checking nans.
df['post'].isnull().sum()

1621

In [13]:
#fill null posts with ''
df['post'].fillna('', inplace=True)
df['post'].head(2)

0    
1    
Name: post, dtype: object

# Count Vectorizer for EDA
Since I know the title column is descriptive, I'll start with that column only, most of the posts from art are images, and many from hip hop are also not verbose, they have video content.  I'll do TFIDF for title and then post columns to compare the text difference.

In [14]:
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords

In [15]:
#baseline, I'll want models to predict better than this.
#value counts not normalized are 998:1 (art), 740:0 (hiphop)  stratify anyway.
value_counts =df.subreddit.value_counts(normalize=True)
value_counts

1    0.574223
0    0.425777
Name: subreddit, dtype: float64

In [16]:
#split both and use just title or post
X =df[['title', 'post']]
y =df['subreddit']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

In [135]:
#coming back to remove more stop words that show up in the important predictors.
#stopwords.words('english')
#var =[word for word in stopwords.words('english')]
#type(var)
#var.append(['https'])
#wordlist =stopwords.words('english').extend('https')
#wordlist

### Creating a new stopword list for all the models after I saw the obvious words that are related to subreddit, with more research I would probably be able to confidently remove more.
I suspect that fresh and amp should be removed, and I considered the years and numbers, but it would require more time to filter out the things that are specifically reddit related.  Even though these things help to classify, they are not related as directly to the content.  I do see that lemmatizing would be helpful, as in 'feat' and 'ft' for music, that could easily be the first word if they were combined.

In [136]:
# coming back to remove more stop words
from sklearn.feature_extraction import stop_words
wordlist =list(stop_words.ENGLISH_STOP_WORDS)

wordlist += ['https', 'www', 'x200b', 've', 'll', 'jpg', 'gt', '20', '18', 'com', 'st']

In [137]:
cvec =CountVectorizer(stop_words= wordlist, max_features =1000)
#vec fit train
X_train_cvec =pd.DataFrame(cvec.fit_transform(X_train['title']).todense(), columns=cvec.get_feature_names())

In [138]:
#transform test
X_test_cvec =pd.DataFrame(cvec.transform(X_test['title']).todense(), columns=cvec.get_feature_names())
X_train_cvec.head(2)

Unnamed: 0,000,03,10,100,10th,11,11th,12,13,13alloonz,...,yo,york,young,youtube,yung,yungeen,zabou,zealand,zillakami,zombies
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
#I want multinomial naive bayes
#X_train_cvec.describe()

In [139]:
#no priors
nb =MultinomialNB()
nb.fit(X_train_cvec, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [140]:
predictions =nb.predict(X_test_cvec)

In [141]:
#accuracy should be fine for this classification model
nb.score(X_train_cvec, y_train)

0.9616270145817345

In [142]:
nb.score(X_test_cvec, y_test)

0.9218390804597701

In [143]:
len(nb.coef_)

1

In [144]:
#coef search
nb.coef_.shape     #gives me an array of coefs from multiNB

(1, 1000)

In [145]:
X_train_cvec.columns.shape

(1000,)

In [146]:
coefs = pd.DataFrame(nb.coef_, columns = cvec.get_feature_names()).T
#c_mask = coefs[coefs['Coefficient'] != 0]
#l_coefs =c_mask.abs().sort_values('Coefficient', ascending=False)#[:30]
#l_coefs

In [147]:
coefs.columns =['coefficients']

In [206]:
#strength of predictors or NOT
coefs.sort_values('coefficients', ascending =False)[:30]#[-30:]

Unnamed: 0,coefficients
art,-3.209673
street,-3.240684
mural,-4.300681
graffiti,-4.395991
new,-4.906817
city,-5.089138
artist,-5.140432
wall,-5.140432
london,-5.251657
amp,-5.312282


In [149]:
coefs.index

Index(['000', '03', '10', '100', '10th', '11', '11th', '12', '13', '13alloonz',
       ...
       'yo', 'york', 'young', 'youtube', 'yung', 'yungeen', 'zabou', 'zealand',
       'zillakami', 'zombies'],
      dtype='object', length=1000)

## Pretty good score on ['title'] cvec
These are the words that appear the most times.

In [150]:
#most popular words in the [title] for both subreddits.  It'd be neat to see which are from which.  Later
popword = X_train_cvec.sum(axis=0)
popword.sort_values(ascending = False).head(50)

fresh        166
art          132
street       129
amp           90
video         70
ft            58
feat          44
mural         43
lil           41
new           40
graffiti      39
official      39
album         31
prod          29
music         28
2019          28
city          25
wall          22
artist        20
rap           17
london        16
nyc           16
freestyle     15
like          14
piece         14
ep            14
spain         13
big           13
black         12
uk            12
la            12
love          12
england       11
2018          11
west          11
live          11
rapper        11
oc            11
nav           11
baby          11
san           11
man           11
original      11
world         11
da            11
brooklyn      10
cologne       10
young         10
york          10
barcelona     10
dtype: int64

### run it with the post column alone
Not that bad, considering the baseline.  But still will assume the interaction column will be better.

In [151]:
cvec_post =CountVectorizer(stop_words='english', max_features =1000)

In [152]:
#Do cvec on the 'posts'
#vec fit train
X_train_post_cvec =pd.DataFrame(cvec_post.fit_transform(X_train['post']).todense(), columns=cvec_post.get_feature_names())

In [153]:
#transform test
X_test_post_cvec =pd.DataFrame(cvec_post.transform(X_test['post']).todense(), columns=cvec_post.get_feature_names())

In [154]:
X_train_post_cvec.shape

(1303, 1000)

In [155]:
nb_post=MultinomialNB()
post_model =nb_post.fit(X_train_post_cvec, y_train)

In [156]:
post_pred =post_model.predict(X_test_post_cvec)

In [157]:
print(nb_post.score(X_train_post_cvec, y_train))
nb_post.score(X_test_post_cvec, y_test)

0.6170376055257099


0.6022988505747127

In [158]:
#most popular words in the post for Both subreddits.
#PROB a combo of what both coef dfs would show if both had a target of one...
#poppost = X_train_post_cvec.sum(axis=0)
#poppost.sort_values(ascending = False).head(50)

# Try pipeline with TFidf and Mnb with ['title']
I ended up adding in gridsearch to give it a shot, and it improved the score just slightly, it seems like the model is performing pretty well.  In the next notebook I will add more features from a couple small new scrapes and do the same pipe and gridsearch and see if more posts and trying an interaction term will make a better model.

In [159]:
pipe =Pipeline([
    ('tif', TfidfVectorizer(stop_words=wordlist, max_features=1000)),
    ('mnb', MultinomialNB())
])

#gridsearch
params ={
    'tif__max_df': [.94, .96],
    'tif__min_df': [2, 5, 10],
    'tif__ngram_range': [(1, 1), (1,3)]
}

In [160]:
gs =GridSearchCV(estimator =pipe, param_grid=params)

In [161]:
# Evaluate how your model will perform on unseen data
cross_val_score(pipe, X_train['title'], y_train, cv=3).mean() 

0.9155763899924079

In [162]:
# Fit your model
gs.fit(X_train['title'], y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tif', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...rue,
        vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tif__max_df': [0.94, 0.96], 'tif__min_df': [2, 5, 10], 'tif__ngram_range': [(1, 1), (1, 3)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [163]:
#with title column and without visualization...
# train and test scores
print(gs.score(X_train['title'], y_train))
gs.score(X_test['title'], y_test)

0.9669992325402916


0.9218390804597701

In [164]:
#want to find the coefs from the gridsearch, get them out.
gs.best_estimator_.named_steps.tif.get_feature_names()

['000',
 '03',
 '10',
 '100',
 '10th',
 '11',
 '11th',
 '12',
 '13',
 '13alloonz',
 '13lack',
 '15',
 '1999',
 '1st',
 '2005',
 '2016',
 '2017',
 '2018',
 '2019',
 '21',
 '22',
 '24',
 '28',
 '2pac',
 '30',
 '3024',
 'abandoned',
 'abstractdissent',
 'abuse',
 'ace',
 'act',
 'action',
 'ad',
 'advice',
 'ago',
 'ain',
 'album',
 'alfama',
 'allblack',
 'alley',
 'alleyway',
 'alto',
 'amazing',
 'amp',
 'anders',
 'anderson',
 'andy',
 'angeles',
 'announces',
 'anti',
 'antiguo',
 'antonio',
 'ap',
 'app',
 'argentina',
 'art',
 'arte',
 'artist',
 'artists',
 'athens',
 'atlanta',
 'audio',
 'austin',
 'australia',
 'australian',
 'aviv',
 'away',
 'awesome',
 'az',
 'baby',
 'bad',
 'bada',
 'badazz',
 'bairro',
 'bakersfield',
 'balin',
 'balloons',
 'baltimore',
 'bandana',
 'bang',
 'bank',
 'barcelona',
 'basel',
 'basquiat',
 'battle',
 'bc',
 'beach',
 'beans',
 'beats',
 'beautiful',
 'belgian',
 'belgium',
 'belize',
 'benjamin',
 'berlin',
 'best',
 'big',
 'biggie',
 'bil

In [165]:
#put these in a df
gs.best_estimator_.named_steps.mnb.coef_

array([[-7.18712408, -7.63598843, -6.73382673, -7.13760164, -7.20621172,
        -7.24276277, -7.20621172, -7.63598843, -6.95980979, -7.63598843,
        -7.63598843, -7.20327107, -7.63598843, -7.63598843, -7.63598843,
        -7.0403831 , -7.23942152, -5.69586769, -5.58183756, -7.63598843,
        -7.46713405, -7.26097959, -7.63598843, -7.10819298, -7.63598843,
        -7.04201286, -6.91023733, -6.89661884, -7.63598843, -7.63598843,
        -6.99200732, -6.47108867, -6.99620082, -6.92508337, -6.59831661,
        -7.63598843, -7.63598843, -6.89195985, -7.63598843, -6.00535316,
        -6.22606231, -6.93211295, -5.90841888, -5.77936451, -7.63598843,
        -7.63598843, -6.9272747 , -6.09091285, -7.63598843, -6.54594746,
        -7.12679666, -6.63498793, -7.63598843, -6.6151707 , -6.7223802 ,
        -3.73834643, -6.98088495, -5.53041927, -6.17092575, -6.63858389,
        -6.87224674, -7.63598843, -6.39602152, -6.16278057, -7.23573479,
        -6.20649294, -7.23801502, -6.27189753, -6.8

In [207]:
#gridsearch mnb/tif df of coefs.
mnb_tf_coefs = pd.DataFrame(gs.best_estimator_.named_steps.mnb.coef_,
                            columns = gs.best_estimator_.named_steps.tif.get_feature_names()).T

In [208]:
mnb_tf_coefs.columns =['coefficients']

## Coefs ['title'] art as as the target (1) 

In [168]:
#strength of predictors or NOT
mnb_tf_coefs.sort_values('coefficients', ascending =False)[:30]#[-30:]

Unnamed: 0,coefficients
art,-3.738346
street,-3.791069
mural,-4.683412
graffiti,-4.771298
new,-5.243831
barcelona,-5.367849
london,-5.382491
nyc,-5.452749
wall,-5.464123
uk,-5.476132


# Evaluate
Find the best estimator coefficients, since Tfidf puts weight on the unique words I wanted to let it do the work for me there and then just try to filter off the less ineresting indicators with the parameters.

In [169]:
gs.best_estimator_#.coefs_ 

Pipeline(memory=None,
     steps=[('tif', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.94, max_features=1000, min_df=2,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...rue,
        vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [170]:
from sklearn.metrics import confusion_matrix
#with my preds from ['title'], multNB and cvec
cm =confusion_matrix(y_test, predictions)

In [171]:
#the total number is not very many
cm_df =pd.DataFrame(cm, columns =['pred neg', 'pred pos'],
            index =['actual neg', 'actual pos'])
cm_df

Unnamed: 0,pred neg,pred pos
actual neg,160,25
actual pos,9,241


## Evaluate
### Type I error: 
-False positives, here positives are street art, there are 25 type I errors, 25 times the model misclassifies 0(hip hop) as 1(street art).  So this is bad for if I want to see art but see hip hop, 25 hip hop posts exist in the street art subreffit.

### Type II error:
-False negatives, here positives are still street art, there are only 9 type II errors, 9 times the model misclassifies 1(street art) as 0(hip hop), so 9  street art posts exist in hip hop.

I considered scoring the same gs model on the posts, assuming the title model is better, and decided to just run a gridsearch over my first pipeline of Tfidf and MNB, the scores from that are

-c_val 0.6124318025319138

-train 0.6162701458173446

-test  0.6022988505747127

The post column alone is less predictive, and I will in the next notebook add the title to the post columns to maximize the word content.

In [174]:
#try to do the same with the posts now.
pipe_posts =Pipeline([
    ('tif_post', TfidfVectorizer(stop_words=wordlist, max_features=1000)),
    ('mnb_post', MultinomialNB())
])

In [175]:
#gridsearch
params ={
    'tif__max_df': [.94, .96],
    'tif__min_df': [2, 5, 10],
    'tif__ngram_range': [(1, 1), (1,3)]
}

In [176]:
gs_posts =GridSearchCV(estimator =pipe, param_grid=params)

In [177]:
cross_val_score(gs_posts, X_train['post'], y_train, cv=3).mean()



0.6093613715415717

In [178]:
#fit post model
gs_posts.fit(X_train['post'], y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('tif', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...rue,
        vocabulary=None)), ('mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'tif__max_df': [0.94, 0.96], 'tif__min_df': [2, 5, 10], 'tif__ngram_range': [(1, 1), (1, 3)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [179]:
#post column train and test scores
print(gs_posts.score(X_train['post'], y_train))
gs_posts.score(X_test['post'], y_test)

0.6193399846508059


0.6022988505747127

Not really an improvement, of course my text material is pitiful, titles are much better.

# What about HHop as the target?

Since I can't really read the less important coefficients as indicaors per se for the (0) target, which was hip hop, I'll swap the values and actually model for hip hop as (1) so I can see from the tfidf which words are the predictors.

In [191]:
HHop.tail(2)

Unnamed: 0,post,title,subreddit
1736,,[FRESH] Kobie Dee - This Life ft. Bea Moon. — ...,hiphopheads
1737,,[FRESH VIDEO] * NEXT UP * DREEKDADON- ‘DON STY...,hiphopheads


In [192]:
#map 1 for art and 0 for music
HHop['subreddit'] =HHop['subreddit'].map({'hiphopheads': 1, 'streetart': 0})
HHop.head(2)

Unnamed: 0,post,title,subreddit
0,,Brexcellent!,0
1,,Graffiti - LIDE - Yellow Letters,0


In [193]:
HHop.tail(2)

Unnamed: 0,post,title,subreddit
1736,,[FRESH] Kobie Dee - This Life ft. Bea Moon. — ...,1
1737,,[FRESH VIDEO] * NEXT UP * DREEKDADON- ‘DON STY...,1


In [194]:
#new split to get the new 1 target in there
X =HHop[['title']]
y =HHop['subreddit']

XH_train, XH_test, yh_train, yh_test = train_test_split(X, y, test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

## params
I do set the min and max_dfs because it seems to be an easy automated feature that will be effective in sifting out non predictive words, although it would be interesting to see the ones that do appear so much, just as it would be interesting to explore the stopwords further, in the interest of time I am trusting these will make a better model and I am plainly using them.

In [195]:
hh_pipe =Pipeline([
    ('hh_tif', TfidfVectorizer(stop_words=wordlist, max_features=1000)),
    ('hh_mnb', MultinomialNB())
])

#gridsearch
hh_params ={
    'hh_tif__max_df': [.94, .96],
    'hh_tif__min_df': [2, 5, 10],
    'hh_tif__ngram_range': [(1, 1), (1,3)]
}

In [196]:
#Don't care about the nans in post right now, just looking at ['title'] to compare to the art title model
#same gs params
HHgs =GridSearchCV(estimator =hh_pipe, param_grid=hh_params)

In [197]:
# Evaluate how hhop will perform on unseen data
cross_val_score(hh_pipe, XH_train['title'], yh_train, cv=3).mean() 

0.904834295601815

In [198]:
# Fit hhop model
HHgs.fit(XH_train['title'], yh_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('hh_tif', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=1000, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True...,
        vocabulary=None)), ('hh_mnb', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'hh_tif__max_df': [0.94, 0.96], 'hh_tif__min_df': [2, 5, 10], 'hh_tif__ngram_range': [(1, 1), (1, 3)]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [199]:
#with title column and without visualization...
# train and test scores
print(HHgs.score(XH_train['title'], yh_train))
HHgs.score(XH_test['title'], yh_test)

0.955487336914812


0.9080459770114943

In [200]:
HHgs.best_estimator_.named_steps.hh_tif

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.94, max_features=1000, min_df=2,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=['keep', 'meanwhile', 'you', 'this', 'latterly', 'someone', 'eight', 'must', 'same', 'alone', 'from', 'often', 'ie', 'about', 'once', 'nobody', 'may', 'sincere', 'less', 'which', 'whenever', 'five', 'twenty', 'former', 'already', 'sometimes', 'thence', 'during', 'as', 'wherein', 'nor', 'f...', 'both', 'ltd', 'fire', 'https', 'www', 'x200b', 've', 'll', 'jpg', 'gt', '20', '18', 'com', 'st'],
        strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

# compare scores between art as 1 and hhop as 1
the gridsearch chose the same best parameters except n_gram range, 
max_df = .94
min_df =2
ngram_range for art =(1, 1)
ngram_range for hhop =(1, 3)

### cval, train, test art as 1
-0.9155746243621662

-0.9669992325402916

-0.9241379310344827

### cval, train, test hhop as 1
-0.9071366774370112

-0.9570222563315426

-0.9080459770114943

# Compare hhop t-dif to art t-dif from gs mnb 

In [201]:
#want to find the coefs from the gridsearch with hhop as the target.
#HHgs.best_estimator_.named_steps.hh_tif.get_feature_names()
HHgs.best_estimator_.named_steps.hh_mnb.coef_

array([[-7.12312421, -6.53729523, -6.86247102, -6.99837089, -6.37434845,
        -6.90607558, -6.68164814, -6.96127354, -7.61810198, -6.7200183 ,
        -7.61810198, -5.85201789, -6.68017611, -7.09377156, -6.80680805,
        -6.89460893, -6.89460893, -6.69938976, -7.61810198, -6.68668434,
        -4.96679049, -7.61810198, -7.61810198, -7.18458397, -4.41399464,
        -6.96118069, -6.89310877, -7.61810198, -7.61810198, -6.50021979,
        -6.79300364, -7.61810198, -7.61810198, -7.32441564, -7.32258349,
        -7.61810198, -7.61810198, -7.20461624, -7.61810198, -6.9249548 ,
        -7.61810198, -7.15535365, -6.14448862, -7.61810198, -7.61810198,
        -7.61810198, -7.61810198, -6.54325962, -7.61810198, -7.61810198,
        -5.84170113, -6.02236848, -6.76186043, -6.7200888 , -7.61810198,
        -7.61810198, -7.61810198, -7.32258349, -7.61810198, -7.61810198,
        -7.61810198, -7.61810198, -6.62015136, -7.61810198, -6.13118837,
        -6.84226043, -6.84226043, -7.61810198, -7.6

In [202]:
#gridsearch mnb/tif df of coefs.
hh1_mnb_tf_coefs = pd.DataFrame(HHgs.best_estimator_.named_steps.hh_mnb.coef_,
                            columns = HHgs.best_estimator_.named_steps.hh_tif.get_feature_names()).T

In [203]:
hh1_mnb_tf_coefs.columns = ['H-Hop coefs']

In [209]:
#strength of predictors or NOT
hh1_mnb_tf_coefs.sort_values('H-Hop coefs', ascending =False)[:30]#[-30:]

Unnamed: 0,H-Hop coefs
fresh,-3.416723
amp,-4.413995
ft,-4.443461
video,-4.579234
feat,-4.740743
lil,-4.742205
album,-4.96679
fresh video,-5.033287
official,-5.098702
prod,-5.144727
