# A model to classify comments on Reddit channel GirlGamers using NLP

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

In [2]:
from fastai.text import *

## Language model

Note that language models can use a lot of GPU, so you may need to decrease batchsize here.

In [None]:
bs=48

In [4]:
df = pd.read_csv('/home/jmn21373/.fastai/data/reddit/GirlGamers/reddit_GirlGamers.csv')

In [None]:
df.head()

In [None]:
df[pd.isna(df['body'])]

In [None]:
df.dropna(inplace=True)

In [None]:
df[pd.isna(df['body'])]

I initially got an error message from fastai when I tried to create a language model TextList because I had NaN in the body on a few rows.  That's why I dropped them.

In [None]:
data_lm = (TextList.from_df(df=df, cols = "body")
           #Inputs: all the text files in path
            #.filter_by_folder(exclude=['readme']) 
           # Exclude readme.txt.  Oh well.
            .split_by_rand_pct(0.1)
           #We randomly split and keep 10% (10,000 reviews) for validation
            .label_for_lm()           
           #We want to do a language model so we label accordingly
            .databunch(bs=bs))
data_lm.save('data_lm.pkl')

Since I didn't specify a path, it puts data_lm.pkl in the default path, which is the same as this notebook, i.e., /home/jmn21373/course-v3/nbs/dl1

In [None]:
data_lm = load_data(path, 'data_lm.pkl', bs=bs)

In [None]:
data_lm.show_batch()

We can then put this in a learner object very easily with a model loaded with the pretrained weights. They'll be downloaded the first time you'll execute the following line and stored in `~/.fastai/models/` (or elsewhere if you specified different paths in your config file).

In [None]:
learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3)

In [None]:
learn.lr_find()

Got CUDA out of memory error, so reducing batch size and trying again.

Actually that wasn't the problem.  I still had a HUGE amount of memory devoted to my Amazon polarity data experiment, which didn't work so hot.

In [15]:
bs =48

In [16]:
data_lm = load_data('/home/jmn21373/course-v3/nbs/dl1', 'data_lm.pkl', bs=bs)

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot(skip_end=15)

In [None]:
learn.fit_one_cycle(1, 5e-1, moms=(0.8,0.7))

In [None]:
# After fit one cycle is done, try running this to return accuracy and see if it matches:
learn.validate()

What we see here is that validation loss is SLIGHTLY LESS than training loss.  Underfitting?  Look it up.  Accuracy is not great; I would hope for about 30%.

In [None]:
learn.save('fit_head')

In [None]:
learn.save_encoder('crude_enc')

In [None]:
learn.load('fit_head');

To complete the fine-tuning, we can then unfeeze and launch a new training.

In [None]:
learn.unfreeze()

In [None]:
learn.fit_one_cycle(10, 1e-3, moms=(0.8,0.7))

In [None]:
# After fit one cycle is done, try running this to return accuracy and see if it matches:
learn.validate()

25% accuracy is not bad.  Let's see if we can get it higher.

In [None]:
learn.lr_find(start_lr=1e-9, end_lr=1)

In [None]:
learn.recorder.plot()

So this looks like I am in some sort of minimum; whether it's local or global or a saddle point, I can't tell.  See

https://forums.fast.ai/t/interpreting-the-sched-plot-from-lr-find/12329/2

I can try bumping the learning rate way up and see if it jumps me out of a local minimum and into a better one.  Worth a shot.  Better save current model first though.

In [None]:
learn.save('fine_tuned')

In [None]:
learn.validate()

In [None]:
learn.fit_one_cycle(1, 5e-1, moms=(0.8,0.7))

Nope. Looks like 25% is the best I can do.  By observation, it looks like there are a lot of sentence fragments, so the quality of text in Reddit comments isn't on par with Wikipedia.

In [None]:
learn.load('fine_tuned');

In [None]:
learn.validate()

In [None]:
TEXT = "I liked this game because"
N_WORDS = 20
N_SENTENCES = 2

In [None]:
print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75) for _ in range(N_SENTENCES)))

We not only have to save the model, but also it's encoder, the part that's responsible for creating and updating the hidden state. For the next part, we don't care about the part that tries to guess the next word.

In [None]:
learn.save_encoder('fine_tuned_enc')

## Classifier

Got to create labels based on score.  

In [None]:
df.describe()

In [None]:
mean = df['score'].mean()
mean

In [None]:
std = df['score'].std()
std

In [33]:
pd.set_option('display.max_colwidth', -1)

In [None]:
df[df['score'] > 2180]['body']

In [None]:
df[df['score'] < 300]['body']

In [None]:
dfc = df[(df['score'] > mean+std) | (df['score'] < mean-std)]
dfc.head()

In [None]:
dfc[dfc['score'] > 600]

In [None]:
dfc.columns

In [None]:
mp = mean + std
def label(row):
    if row['score'] > mp:
        return 'pos'
    else:
        return 'neg'

In [None]:
dfc.apply(lambda row: label(row), axis = 1)

In [None]:
dfc[dfc.index == 10797]

In [None]:
dfc['label'] = dfc.apply (lambda row: label(row), axis=1)

In [None]:
dfc.head()

In [None]:
bs = 96

In [17]:
data_clas = (TextList.from_df(df=dfc, cols = ["body"], vocab=data_lm.vocab)
           .split_by_rand_pct(0.1)
           #We split by Boolean in is_valid column (True or False)
           .label_from_df(cols= "label")          
           #We label based on rating on Amazon (number of stars)
           .databunch(bs=bs)
            )

In [None]:
data_clas.save('data_clas.pkl')

In [None]:
data_clas = load_data(path, 'data_clas.pkl', bs=bs)

In [None]:
data_clas.show_batch()

In [18]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.lr_find()

In [None]:
learn.recorder.plot()

So the first time I ran this it was a little fishy.  It just returned everything as positive.  Why?  Because I naively used 1 sigma as a cutoff without looking at the distribution of the data.  As a result I have a dataset with highly skewed classes.  Let's try this again.

In [None]:
learn.fit_one_cycle(1, 5e-1, moms=(0.8,0.7))

In [None]:
dfc.describe()

In [None]:
learn.save('first')

In [None]:
learn.load('first');

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
learn.save('second')

In [None]:
learn.load('second');

In [None]:
#learn.freeze_to(-3)
#learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [19]:
learn.load('third');

In [20]:
learn.validate()

[0.5525025, tensor(0.7158)]

In [21]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.590015,0.516599,0.726169,03:26
1,0.573332,0.528842,0.723641,03:20


train loss > validation loss so I can try to squeeze more performance by either increasing learning rate or going for more epochs, i.e., training more.

In [45]:
learn.save("fourth")

In [46]:
learn.fit_one_cycle(4, slice(5e-2/(2.6**4),5e-2), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.670821,0.733579,0.575474,03:20
1,0.709279,0.915317,0.417699,03:11
2,0.688935,0.804065,0.417952,03:11


KeyboardInterrupt: 

I tried increasing LR and # epochs at the same time and it quickly diverged.  I will try again, this time reducing LR and increasing epochs.

In [47]:
learn.load('fourth');

  except: warn("Wasn't able to properly load the optimizer state again.")


In [48]:
learn.validate()

[0.5288423, tensor(0.7236)]

In [49]:
learn.fit_one_cycle(4, slice(5e-4/(2.6**4),5e-4), moms=(0.8,0.7))

epoch,train_loss,valid_loss,accuracy,time
0,0.575951,7.217884,0.73426,03:04
1,0.571246,4.310321,0.738306,03:06
2,0.557483,0.612212,0.733249,03:22
3,0.548533,2.002439,0.736536,03:27


In [58]:
#dfc[dfc['label'] == 'neg']['body']

In [57]:
#dfc[dfc['label'] == 'pos']['body']

In [50]:
learn.predict("I really loved that game, it was awesome!")

(Category pos, tensor(1), tensor([0.0494, 0.9506]))

In [51]:
learn.predict("[removed]")

(Category neg, tensor(0), tensor([0.9259, 0.0741]))

In [52]:
learn.predict("Video games are my happy place.")

(Category pos, tensor(1), tensor([0.1336, 0.8664]))

In [53]:
learn.predict("Not all men share these feelings.  I think you are attacking me just because I am a man.")

(Category neg, tensor(0), tensor([0.8019, 0.1981]))

In [54]:
learn.predict("I just want to play the game and feel like I have a connection with the character.")

(Category pos, tensor(1), tensor([0.1552, 0.8448]))

In [55]:
learn.predict("Why does this subreddit even exist?  What a waste of time.  Who cares about girl gamers?")

(Category neg, tensor(0), tensor([0.6675, 0.3325]))

In [56]:
learn.predict("I think the artwork and character development are superb.")

(Category pos, tensor(1), tensor([0.3631, 0.6369]))

In [None]:
dfc[dfc['label'] == 'neg'].count()

In [None]:
dfc[dfc['label'] == 'pos'].count()

In [None]:
df.hist(column = 'score', bins = 1000)

In [None]:
df.describe()

In [None]:
df[df['score'] <= 1.0].count()

In [None]:
212315/567734

In [None]:
df.quantile(0.1)

In [None]:
df.quantile(0.5)

In [None]:
df.quantile(0.75)

In [None]:
df.quantile(0.97)

In [None]:
df.quantile(0.03)

In [None]:
df[df.score >= 26.0].count()

In [None]:
df[df.score <= -1.0].count()

Looks like using top/bottom 3% is the way to go.  Just like textfit.

In [5]:
bottom = df.quantile(0.03)
top = df.quantile(0.97)

In [6]:
print (top[0])
print (bottom[0])

26.0
-1.0


In [7]:
dfc = df[(df['score'] >= top[0]) | (df['score'] <= bottom[0])]
dfc.head()

Unnamed: 0,score,body
10797,256,"Indeed.\n\nIn fact, ""don't be transphobic"" fal..."
10798,256,I couldn't care either way.
10799,256,Basically you can endorse players for being go...
168058,257,I'll probably never play this game because I d...
168059,257,"Psh, you attention whore, all mentioning your ..."


In [8]:
def label(row):
    if row['score'] >= top[0]:
        return 'pos'
    else:
        return 'neg'

In [9]:
dfc.apply(lambda row: label(row), axis = 1)

10797     pos
10798     pos
10799     pos
168058    pos
168059    pos
282730    pos
346508    pos
396192    pos
414390    pos
414391    pos
414392    pos
428071    pos
428072    pos
428073    pos
454677    pos
454678    pos
454679    pos
465849    pos
474110    pos
474111    pos
477342    pos
480125    pos
486981    neg
486982    neg
490485    pos
490486    pos
492017    pos
492018    pos
493352    pos
493353    pos
         ... 
567719    neg
567720    neg
567721    neg
567722    neg
567723    neg
567724    neg
567725    neg
567726    neg
567727    neg
567728    neg
567729    neg
567730    neg
567731    neg
567732    neg
567733    neg
567734    neg
567735    neg
567736    neg
567737    neg
567738    neg
567739    neg
567740    neg
567741    neg
567742    neg
567743    neg
567744    neg
567745    neg
567746    neg
567747    neg
567748    neg
Length: 39551, dtype: object

In [10]:
dfc['label'] = dfc.apply (lambda row: label(row), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [59]:
#dfc.head()

In [12]:
dfc[dfc.label == 'neg'].count()

score    21723
body     21723
label    21723
dtype: int64

In [13]:
dfc[dfc.label == 'pos'].count()

score    17828
body     17828
label    17828
dtype: int64

In [None]:
data_clas = (TextList.from_df(df=dfc, cols = ["body"], vocab=data_lm.vocab)
           .split_by_rand_pct(0.1)
           #We split by Boolean in is_valid column (True or False)
           .label_from_df(cols= "label")          
           #We label based on rating on Amazon (number of stars)
           .databunch(bs=bs)
            )

In [None]:
data_clas.save('data_clas.pkl')

In [None]:
learn = text_classifier_learner(data_clas, AWD_LSTM, drop_mult=0.5)
learn.load_encoder('fine_tuned_enc')

In [None]:
learn.fit_one_cycle(1, 5e-1, moms=(0.8,0.7))

Not as good as I would have hoped, but this does look better.  Time to refine.

In [None]:
learn.save('first')

In [None]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2), moms=(0.8,0.7))

In [None]:
learn.save('second')

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3), moms=(0.8,0.7))

In [None]:
learn.save('third')

In [3]:
learn.load('third')

NameError: name 'learn' is not defined

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3), moms=(0.8,0.7))