# Tweets about eating disorders
## 05. Training Random Forest, RNN and Bi-LSTM

The training and evaluation of Random Forest, RNN and Bi-LSTM models on the 4 existing categorizations in the dataset is shown below:

- Category 1: in this category, tweets that have been written by people suffering from ED are represented with a value of 1 and the rest with a value of 0. In order to be able to assess this, each user profile was accessed by looking at the user's description and different tweets published by the user. In this way it was possible to determine which tweets were written by people who publicly mentioned having an ED and which were not.
- Category 2: tweets that promoted having an ED were labelled with a value of 1 and all other tweets were labelled with a value of 0. There are communities of people who suffer from EDs who try to encourage other people to also suffer from it by promoting it as if it were something positive or fashionable. There are many studies that talk about pro ED communities and that are detected with terms such as "proana" or "pro-anorexia" 9.
- Category 3: in this category, informative tweets were represented with a value of 1 and non-informative tweets with a value of 0. Informative tweets are those that show information with the aim of informing readers, while the rest are written texts in which the author reflects a subjective opinion.
- Category 4: in category 4, scientific tweets were labelled with a value of 1 and the rest with a value of 0. A tweet of an informative nature that had been written by a person belonging to the field of research, for example, a doctor of philosophy in different subjects, was labelled as a scientific tweet. Scientific tweets were also those that shared links to articles published in scientific journals.

For the selection of Random Forest hyperparameters, a search of the best hyperparameters for each of the categories is performed using GridSearchCV applying a 5-fold cross-validation.

The generated RNN and Bi-LSTM models were selected after performing several tests with other configurations that reflected lower performance for the problem presented in this research.
 

In [1]:
import os
import pandas as pd
import numpy as np
import json
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import re, string, unicodedata
import nltk
from nltk import word_tokenize, sent_tokenize, FreqDist
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer
nltk.download
tweets = pd.read_csv('tweets_cleaned.csv', encoding='utf8', error_bad_lines=False)



  exec(code_obj, self.user_global_ns, self.user_ns)


In [2]:
tweets.head(2)

Unnamed: 0.1,Unnamed: 0,stream_group,text_orig,Commercial,POLITICS,ED,Family,ED_patience,ProED,Offensive,Informative,Scientific,Sad,hashtag,text,Segmented#,text_similar
0,0,1,RT @beatED: Learn more about anorexia and buli...,0,0,1,0.0,0,0,0,1,0,0.0,['BBCPanorama'],"['learn', 'anorexia', 'bulimia', 'well', 'eati...",bbc panorama,[]
1,1,1,A woman tries to balance her relationships wit...,0,0,1,0.0,0,0,0,0,0,0.0,"['anorexia', 'BodyofWater']","['woman', 'try', 'balance', 'relationship', 'm...",anorexia bodyof water,[]


In [5]:
tweets.columns=['num1','stream_group','text_orig','f1_commercial','f2_politics','f3_ed','f4_family','f5_edpatient','f6_proed','f7_offensive','f8_info','f9_scientific','f10_sad','hashtag','text','segmented','text_similar']

In [6]:
cols=['f1_commercial','f2_politics','f3_ed','f4_family','f5_edpatient','f6_proed','f7_offensive','f8_info','f9_scientific','f10_sad']

In [7]:
tweets1 = tweets.copy().drop(['f1_commercial', 'f2_politics', 'f3_ed', 'f4_family', 'f7_offensive', 'f10_sad'], axis=1)

In [8]:
import pandas as pd
import numpy as np

import spacy
import nltk
import nltk.data
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
import regex as re
import string
from collections import defaultdict

import matplotlib.pyplot as plt
import seaborn as sns
pd.set_option('display.max_colwidth', None)

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, accuracy_score

from simpletransformers.classification import ClassificationModel


import io

  warn(f"Failed to load image Python extension: {e}")


In [9]:
punctuations = "¡!#$%&'()*+,-./:;<=>¿?@[\]^_`{|}~"

def read_txt(filename):
    list = []
    with open(filename, 'r', encoding='utf-8') as f:
        data = f.readlines()
        for line in data:
            list.append(str(line).replace('\n', ''))
    return list

stopwords = read_txt('english_stopwords.txt')

stemmer = SnowballStemmer('english')


def clean_accents(tweet):
    tweet = re.sub(r"[àáâãäå]", "a", tweet)
    tweet = re.sub(r"ç", "c", tweet)
    tweet = re.sub(r"[èéêë]", "e", tweet)
    tweet = re.sub(r"[ìíîï]", "i", tweet)
    tweet = re.sub(r"[òóôõö]", "o", tweet)
    tweet = re.sub(r"[ùúûü]", "u", tweet)
    tweet = re.sub(r"[ýÿ]", "y", tweet)

    return tweet

def clean_tweet(tweet, stem = False):
    tweet = tweet.lower().strip()
    tweet = re.sub(r'https?:\/\/\S+', '', tweet)
    tweet = re.sub(r'http?:\/\/\S+', '', tweet)
    tweet = re.sub(r'www?:\/\/\S+', '', tweet)
    tweet = re.sub(r'\s([@#][\w_-]+)', "", tweet)
    tweet = re.sub(r"\n", " ", tweet)
    tweet = clean_accents(tweet)
    tweet = re.sub(r"\b(a*ha+h[ha]*|o?l+o+l+[ol]*|x+d+[x*d*]*|a*ja+[j+a+]+)\b", "<risas>", tweet)
    for symbol in punctuations:
        tweet = tweet.replace(symbol, "")
    tokens = []
    for token in tweet.strip().split():
        if token not in punctuations and token not in stopwords:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)

In [10]:
tweets1['text_cleaned'] = tweets['text_orig'].apply(lambda s : clean_tweet(s))
#print(tweets1['text_cleaned'].head(5))

In [11]:
#tweets1.head(2)

In [12]:
df = tweets1.copy()
X = df['text_cleaned']
Y1 = df['f5_edpatient']
Y2 = df['f6_proed']
Y3 = df['f8_info']
Y4 = df['f9_scientific']

In [13]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y1, test_size=0.3, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X, Y2, test_size=0.3, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X, Y3, test_size=0.3, random_state=42)
X4_train, X4_test, y4_train, y4_test = train_test_split(X, Y4, test_size=0.3, random_state=42)

In [14]:
y1_test.value_counts(normalize=True)

1    0.51773
0    0.48227
Name: f5_edpatient, dtype: float64

In [15]:
import numpy as np
import re
import nltk
from sklearn.datasets import load_files
nltk.download('stopwords')
import pickle
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\MICROSOFT\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
documents = []

nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer

stemmer = WordNetLemmatizer()

for sen in range(0, len(X)):
    # Remove all the special characters
    document = re.sub(r'\W', ' ', str(X[sen]))
    
    # remove all single characters
    document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)
    
    # Remove single characters from the start
    document = re.sub(r'\^[a-zA-Z]\s+', ' ', document) 
    
    # Substituting multiple spaces with single space
    document = re.sub(r'\s+', ' ', document, flags=re.I)
    
    # Removing prefixed 'b'
    document = re.sub(r'^b\s+', '', document)
    
    # Converting to Lowercase
    document = document.lower()
    
    # Lemmatization
    document = document.split()

    document = [stemmer.lemmatize(word) for word in document]
    document = ' '.join(document)
    
    documents.append(document)
    


[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\MICROSOFT\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [17]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(documents).toarray()

In [18]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

In [19]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y1, test_size=0.3, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X, Y2, test_size=0.3, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X, Y3, test_size=0.3, random_state=42)
X4_train, X4_test, y4_train, y4_test = train_test_split(X, Y4, test_size=0.3, random_state=42)

In [28]:
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_jobs=-1,max_features= 'sqrt' ,n_estimators=50, oob_score = True) 

param_grid = { 
    'n_estimators': [200, 700, 800, 1000, 1200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X, Y1)
print(CV_rfc.best_params_)

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X, Y2)
print(CV_rfc.best_params_)

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X, Y3)
print(CV_rfc.best_params_)

CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= 5)
CV_rfc.fit(X, Y4)
print(CV_rfc.best_params_)

{'criterion': 'gini', 'max_depth': 7, 'max_features': 'log2', 'n_estimators': 200}
{'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto', 'n_estimators': 1000}
{'criterion': 'gini', 'max_depth': 8, 'max_features': 'sqrt', 'n_estimators': 800}
{'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto', 'n_estimators': 1000}


In [22]:
from sklearn.model_selection import cross_validate
import time
start_time = time.time()
from sklearn.ensemble import RandomForestClassifier
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y1, test_size=0.3, random_state=42)


classifier = RandomForestClassifier(criterion='gini',max_depth=7,max_features='log2',n_estimators=200, random_state=42)
classifier.fit(X1_train, y1_train) 

y1_pred = classifier.predict(X1_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y1_test,y1_pred))
print(classification_report(y1_test,y1_pred))
print(accuracy_score(y1_test, y1_pred))
rfc_cv_score = cross_validate(classifier, X, Y1, cv=5, scoring=['accuracy','f1'])
print('accuracy: ',rfc_cv_score['test_accuracy'].mean())
print('f1: ',rfc_cv_score['test_f1'].mean())
print("--- %s seconds ---" % (time.time() - start_time))




[[230  42]
 [ 57 235]]
              precision    recall  f1-score   support

           0       0.80      0.85      0.82       272
           1       0.85      0.80      0.83       292

    accuracy                           0.82       564
   macro avg       0.82      0.83      0.82       564
weighted avg       0.83      0.82      0.82       564

0.824468085106383
accuracy:  0.7916482269503545
f1:  0.7976027744860016
--- 1.8179750442504883 seconds ---


In [106]:
# {'criterion': 'gini', 'max_depth': 8, 'max_features': 'auto', 'n_estimators': 1000}
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y2, test_size=0.3, random_state=42)
start_time = time.time()
classifier = RandomForestClassifier(criterion='gini',max_depth=8,max_features='auto',n_estimators=1000, random_state=42)
classifier.fit(X1_train, y1_train) 

y1_pred = classifier.predict(X1_test)

print(confusion_matrix(y1_test,y1_pred))
print(classification_report(y1_test,y1_pred))
print(accuracy_score(y1_test, y1_pred))
rfc_cv_score = cross_validate(classifier, X, Y2, cv=5, scoring=['accuracy','f1'])
print('accuracy: ',rfc_cv_score['test_accuracy'].mean())
print('f1: ',rfc_cv_score['test_f1'].mean())
print("--- %s seconds ---" % (time.time() - start_time))


[[427   0]
 [135   2]]
              precision    recall  f1-score   support

           0       0.76      1.00      0.86       427
           1       1.00      0.01      0.03       137

    accuracy                           0.76       564
   macro avg       0.88      0.51      0.45       564
weighted avg       0.82      0.76      0.66       564

0.7606382978723404
accuracy:  0.7671801418439717
f1:  0.0474154747867857
--- 13.491974830627441 seconds ---


In [107]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y3, test_size=0.3, random_state=42)
start_time = time.time()
classifier = RandomForestClassifier(criterion='gini',max_depth=8,max_features='sqrt',n_estimators=800, random_state=42)
classifier.fit(X1_train, y1_train) 

y1_pred = classifier.predict(X1_test)

print(confusion_matrix(y1_test,y1_pred))
print(classification_report(y1_test,y1_pred))
print(accuracy_score(y1_test, y1_pred))
rfc_cv_score = cross_validate(classifier, X, Y3, cv=5, scoring=['accuracy','f1'])
print('accuracy: ',rfc_cv_score['test_accuracy'].mean())
print('f1: ',rfc_cv_score['test_f1'].mean())
print("--- %s seconds ---" % (time.time() - start_time))



[[375   3]
 [102  84]]
              precision    recall  f1-score   support

           0       0.79      0.99      0.88       378
           1       0.97      0.45      0.62       186

    accuracy                           0.81       564
   macro avg       0.88      0.72      0.75       564
weighted avg       0.85      0.81      0.79       564

0.8138297872340425
accuracy:  0.7373531914893617
f1:  0.49162946182890765
--- 10.476030826568604 seconds ---


In [108]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y4, test_size=0.3, random_state=42)
start_time = time.time()
classifier = RandomForestClassifier(criterion='gini',max_depth=8,max_features='auto',n_estimators=1000, random_state=42)
classifier.fit(X1_train, y1_train) 

y1_pred = classifier.predict(X1_test)

print(confusion_matrix(y1_test,y1_pred))
print(classification_report(y1_test,y1_pred))
print(accuracy_score(y1_test, y1_pred))
rfc_cv_score = cross_validate(classifier, X, Y4, cv=5, scoring=['f1','accuracy'])
print('accuracy: ',rfc_cv_score['test_accuracy'].mean())
print('f1: ',rfc_cv_score['test_f1'].mean())
print("--- %s seconds ---" % (time.time() - start_time))



[[442   1]
 [107  14]]
              precision    recall  f1-score   support

           0       0.81      1.00      0.89       443
           1       0.93      0.12      0.21       121

    accuracy                           0.81       564
   macro avg       0.87      0.56      0.55       564
weighted avg       0.83      0.81      0.74       564

0.8085106382978723
accuracy:  0.8039460992907801
f1:  0.2731931152912991
--- 13.271032810211182 seconds ---


In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from tensorflow.keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from tensorflow.keras.utils import to_categorical
from keras.callbacks import EarlyStopping
%matplotlib inline

In [24]:
df = tweets1.copy()
X = df['text_cleaned']
Y1 = df['f5_edpatient']
Y2 = df['f6_proed']
Y3 = df['f8_info']
Y4 = df['f9_scientific']

In [25]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y1, test_size=0.3, random_state=42)
X2_train, X2_test, y2_train, y2_test = train_test_split(X, Y2, test_size=0.3, random_state=42)
X3_train, X3_test, y3_train, y3_test = train_test_split(X, Y3, test_size=0.3, random_state=42)
X4_train, X4_test, y4_train, y4_test = train_test_split(X, Y4, test_size=0.3, random_state=42)

In [26]:
max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X1_train)
sequences = tok.texts_to_sequences(X1_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

In [27]:
def RNN():
    inputs = Input(name='inputs',shape=[max_len])
    layer = Embedding(max_words,50,input_length=max_len)(inputs)
    layer = LSTM(100)(layer)
    layer = Dense(256,name='FC1')(layer)
    layer = Activation('relu')(layer)
    layer = Dropout(0.1)(layer)
    layer = Dense(1,name='out_layer')(layer)
    layer = Activation('sigmoid')(layer)
    model = Model(inputs=inputs,outputs=layer)
    return model

In [29]:
from keras.utils.vis_utils import plot_model

model = RNN()
model.summary()
#plot_model(model, show_shapes=True, show_layer_names=True)

Model: "model_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputs (InputLayer)         [(None, 150)]             0         
                                                                 
 embedding_1 (Embedding)     (None, 150, 50)           50000     
                                                                 
 lstm_1 (LSTM)               (None, 100)               60400     
                                                                 
 FC1 (Dense)                 (None, 256)               25856     
                                                                 
 activation_2 (Activation)   (None, 256)               0         
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 out_layer (Dense)           (None, 1)                 257 

In [30]:
import keras.backend as K

def f1_score(y_true, y_pred):

    # Count positive samples.
    c1 = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
    c2 = K.sum(K.round(K.clip(y_pred, 0, 1)))
    c3 = K.sum(K.round(K.clip(y_true, 0, 1)))

    # If there are no true samples, fix the F1 score at 0.
    if c3 == 0.0:
        return 0.0

    # How many selected items are relevant?
    precision = c1 / c2

    # How many relevant items are selected?
    recall = c1 / c3

    # Calculate f1_score
    f1_score = 2 * (precision * recall) / (precision + recall)
    return f1_score

model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy',f1_score])

Model: "model_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputs (InputLayer)         [(None, 150)]             0         
                                                                 
 embedding_2 (Embedding)     (None, 150, 50)           50000     
                                                                 
 lstm_2 (LSTM)               (None, 100)               60400     
                                                                 
 FC1 (Dense)                 (None, 256)               25856     
                                                                 
 activation_4 (Activation)   (None, 256)               0         
                                                                 
 dropout_2 (Dropout)         (None, 256)               0         
                                                                 
 out_layer (Dense)           (None, 1)                 257 

In [31]:
model.fit(sequences_matrix,y1_train,batch_size=64,epochs=20,
          validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])

Epoch 1/20
Epoch 2/20


<keras.callbacks.History at 0x1f687736888>

In [32]:
test_sequences = tok.texts_to_sequences(X1_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)

In [33]:
accr = model.evaluate(test_sequences_matrix,y1_test)



In [34]:
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}\n f1: {:0.3f}'.format(accr[0],accr[1],accr[2]))
dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft.to_csv('predictions/cate1-bert-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate1-rnn-total-wrong_preds.csv', encoding='utf-8')

Test set
  Loss: 0.445
  Accuracy: 0.801
 f1: 0.798


In [35]:
# CATEGORY 2
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y2, test_size=0.3, random_state=42)


max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X1_train)
sequences = tok.texts_to_sequences(X1_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)
model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy',f1_score])
model.fit(sequences_matrix,y1_train,batch_size=64,epochs=20,
          validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])
test_sequences = tok.texts_to_sequences(X1_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
accr = model.evaluate(test_sequences_matrix,y1_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}\n f1: {:0.3f}'.format(accr[0],accr[1],accr[2]))

dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft.to_csv('predictions/cate2-rnn-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate2-rnn-total-wrong_preds.csv', encoding='utf-8')

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputs (InputLayer)         [(None, 150)]             0         
                                                                 
 embedding_3 (Embedding)     (None, 150, 50)           50000     
                                                                 
 lstm_3 (LSTM)               (None, 100)               60400     
                                                                 
 FC1 (Dense)                 (None, 256)               25856     
                                                                 
 activation_6 (Activation)   (None, 256)               0         
                                                                 
 dropout_3 (Dropout)         (None, 256)               0         
                                                                 
 out_layer (Dense)           (None, 1)                 257 

In [36]:
# CATEGORY 3
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y3, test_size=0.3, random_state=42)


max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X1_train)
sequences = tok.texts_to_sequences(X1_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)
model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy',f1_score])
model.fit(sequences_matrix,y1_train,batch_size=64,epochs=20,
          validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])
test_sequences = tok.texts_to_sequences(X1_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
accr = model.evaluate(test_sequences_matrix,y1_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}\n f1: {:0.3f}'.format(accr[0],accr[1],accr[2]))
matrix = sklearn.metrics.confusion_matrix(y1_test, y1_pred)
matrix

dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft[dft['Actual'] != dft['Predicted']]
dft.to_csv('predictions/cate3-rnn-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate3-rnn-total-wrong_preds.csv', encoding='utf-8')

Model: "model_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputs (InputLayer)         [(None, 150)]             0         
                                                                 
 embedding_4 (Embedding)     (None, 150, 50)           50000     
                                                                 
 lstm_4 (LSTM)               (None, 100)               60400     
                                                                 
 FC1 (Dense)                 (None, 256)               25856     
                                                                 
 activation_8 (Activation)   (None, 256)               0         
                                                                 
 dropout_4 (Dropout)         (None, 256)               0         
                                                                 
 out_layer (Dense)           (None, 1)                 257 

In [37]:
# CATEGORY 4
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y4, test_size=0.3, random_state=42)


max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X1_train)
sequences = tok.texts_to_sequences(X1_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)
model = RNN()
model.summary()
model.compile(loss='binary_crossentropy',optimizer=RMSprop(),metrics=['accuracy',f1_score])
model.fit(sequences_matrix,y1_train,batch_size=64,epochs=20,
          validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss',min_delta=0.0001)])
test_sequences = tok.texts_to_sequences(X1_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)
accr = model.evaluate(test_sequences_matrix,y1_test)
print('Test set\n  Loss: {:0.3f}\n  Accuracy: {:0.3f}\n f1: {:0.3f}'.format(accr[0],accr[1],accr[2]))
matrix = sklearn.metrics.confusion_matrix(y1_test, y1_pred)
matrix

dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft[dft['Actual'] != dft['Predicted']]

dft.to_csv('predictions/cate4-rnn-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate4-rnn-total-wrong_preds.csv', encoding='utf-8')

Model: "model_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 inputs (InputLayer)         [(None, 150)]             0         
                                                                 
 embedding_5 (Embedding)     (None, 150, 50)           50000     
                                                                 
 lstm_5 (LSTM)               (None, 100)               60400     
                                                                 
 FC1 (Dense)                 (None, 256)               25856     
                                                                 
 activation_10 (Activation)  (None, 256)               0         
                                                                 
 dropout_5 (Dropout)         (None, 256)               0         
                                                                 
 out_layer (Dense)           (None, 1)                 257 

'rt learn more about anorexia and bulimia as well as other eating disorders here'

In [39]:
import tensorflow as tf

In [40]:
# CATEGORY 1
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y1, test_size=0.3, random_state=42)

VOCAB_SIZE=1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(np.asarray(X1_train))

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(2e-4),
              metrics=['accuracy',f1_score])

history = model.fit(X1_train,y1_train, epochs=14,
                    validation_data=(X1_test,y1_test), 
                    validation_steps=30)

test_loss, test_acc, test_f1 = model.evaluate(X1_test,y1_test)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
print('Test f1: {}'.format(test_f1))


matrix = sklearn.metrics.confusion_matrix(y1_test, y1_pred)
matrix

dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft[dft['Actual'] != dft['Predicted']]

dft.to_csv('predictions/cate1-bilstm-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate1-bilstm-total-wrong_preds.csv', encoding='utf-8')

Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Test Loss: 0.8800280690193176
Test Accuracy: 0.7872340679168701
Test f1: 0.7806693911552429


In [41]:
# CATEGORY 2
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y2, test_size=0.3, random_state=42)

VOCAB_SIZE=1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(np.asarray(X1_train))

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(2e-4),
              metrics=['accuracy',f1_score])

history = model.fit(X1_train,y1_train, epochs=14,
                    validation_data=(X1_test,y1_test), 
                    validation_steps=30)

test_loss, test_acc, test_f1 = model.evaluate(X1_test,y1_test)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
print('Test f1: {}'.format(test_f1))

matrix = sklearn.metrics.confusion_matrix(y1_test, y1_pred)
matrix

dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft[dft['Actual'] != dft['Predicted']]

dft.to_csv('predictions/cate2-bilstm-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate2-bilstm-total-wrong_preds.csv', encoding='utf-8')

Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Test Loss: 0.5378828048706055
Test Accuracy: 0.859929084777832
Test f1: 0.6714297533035278


In [42]:
# CATEGORY 3
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y3, test_size=0.3, random_state=42)

VOCAB_SIZE=1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(np.asarray(X1_train))

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(2e-4),
              metrics=['accuracy',f1_score])

history = model.fit(X1_train,y1_train, epochs=14,
                    validation_data=(X1_test,y1_test), 
                    validation_steps=30)

test_loss, test_acc, test_f1 = model.evaluate(X1_test,y1_test)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
print('Test f1: {}'.format(test_f1))

matrix = sklearn.metrics.confusion_matrix(y1_test, y1_pred)
matrix

dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft[dft['Actual'] != dft['Predicted']]

dft.to_csv('predictions/cate3-bilstm-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate3-bilstm-total-wrong_preds.csv', encoding='utf-8')

Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14
Test Loss: 0.9884968400001526
Test Accuracy: 0.7907801270484924
Test f1: 0.6851006150245667


In [None]:
# CATEGORY 4
X1_train, X1_test, y1_train, y1_test = train_test_split(X, Y4, test_size=0.3, random_state=42)

VOCAB_SIZE=1000
encoder = tf.keras.layers.experimental.preprocessing.TextVectorization(
    max_tokens=VOCAB_SIZE)
encoder.adapt(np.asarray(X1_train))

model = tf.keras.Sequential([
    encoder,
    tf.keras.layers.Embedding(
        input_dim=len(encoder.get_vocabulary()),
        output_dim=64,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(2e-4),
              metrics=['accuracy',f1_score])

history = model.fit(X1_train,y1_train, epochs=14,
                    validation_data=(X1_test,y1_test), 
                    validation_steps=30)

test_loss, test_acc, test_f1 = model.evaluate(X1_test,y1_test)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
print('Test f1: {}'.format(test_f1))

matrix = sklearn.metrics.confusion_matrix(y1_test, y1_pred)
matrix
dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})

dft.to_csv('predictions/cate4-bilstm-total-preds.csv', encoding='utf-8')
dft[dft['Actual'] != dft['Predicted']].to_csv('predictions/cate4-bilstm-total-wrong_preds.csv', encoding='utf-8')

Epoch 1/14
Epoch 2/14
Epoch 3/14
Epoch 4/14
Epoch 5/14
Epoch 6/14
Epoch 7/14
Epoch 8/14
Epoch 9/14
Epoch 10/14
Epoch 11/14
Epoch 12/14
Epoch 13/14
Epoch 14/14

In [91]:
dft = pd.DataFrame({'Text':X1_test,'Actual':y1_test,'Predicted':y1_pred})
dft[dft['Actual'] != dft['Predicted']]

Unnamed: 0,Text,Actual,Predicted
1111,developed anorexia years ago even knew edtwt risas,0,1
1448,height bulimia vs recovery therapy learning total food freedom,0,1
1326,tw eating disorders i’m recovery anymore honestly don’t know feel worse relapsing ever thinking could possibly recover pray nobody know irl notices,0,1
270,"tw ed anorexia uhm irl said ""i went ed worst"" ""bsf"" literally responded ""thank fuck dont issues"" fucking insensitive",0,1
865,yall bully dont go eat icecream😀😀💔,0,1
...,...,...,...
1334,european eating disorders review covid19 issue open access,1,0
893,need strict pro ana coach dm pls,0,1
1733,researchbased interventions eating disorders paraphilias,1,0
236,isnt funny love starve second family screams dinner binge,0,1
