# Tweets Impact Analysis with Natural Language Processing



## Importing packages

In [60]:
import pandas as pd
import numpy as np

## Importing datasets

In [61]:
df = pd.read_csv("_dataset/Mercadolibre_tweets.csv")
df["Max_Rt_Fav"] = df[['favorite_count','retweet_count']].max(axis=1)

## Preprocessing and Tokens

Here we clean the text column excluding rare characters (keeping only letters and numbers). Then we create a new column with the tokens of each text tweet. It means that we will descompose each sentence in a group of indenpendent words.

In [62]:
df["text"] = df["text"].str.replace("[^0-9a-zA-Z ]+"," ")
df["text"] = df["text"].str.lower()
df["tokenized"] = df["text"].str.split()

df = df.drop(['id', "created_at", "favorite_count", "retweet_count"], 1)


In [63]:
df.head(5)

Unnamed: 0,text,Max_Rt_Fav,tokenized
0,en 2020 incorporamos un promedio de 40 emplead...,1838,"[en, 2020, incorporamos, un, promedio, de, 40,..."
1,salvaje usa nuestra caja para dormir la siesta...,886,"[salvaje, usa, nuestra, caja, para, dormir, la..."
2,d a a d a m s de 80 mil pymes argentinas vend...,936,"[d, a, a, d, a, m, s, de, 80, mil, pymes, arge..."
3,sab as que m s de la mitad de las compras se ...,239,"[sab, as, que, m, s, de, la, mitad, de, las, c..."
4,conoc la historia de nouvelle factory una py...,59,"[conoc, la, historia, de, nouvelle, factory, u..."


Now we create a big matrix with all tokens, initialized with zeros. Later, it is probably that we will only be working with tokens that are more than X times repeated and under Y times repeated, to handle under/over fitting.

In [64]:
tokens_no_repeated = set()
tokens_repeated = set()

def clasify_tokens(vector):
    for token in vector:
        if token in tokens_no_repeated:
            tokens_repeated.add(token)
        else:
            tokens_no_repeated.add(token)

In [65]:
tokens_analyzed = df["tokenized"].apply(clasify_tokens)

In [66]:
print(f"Hay {len(tokens_no_repeated)} tokens que no se repiten.")
print(f"Hay {len(tokens_repeated)} tokens que se repiten más de una vez.")
print(f"Solo un {len(tokens_repeated)/(len(tokens_no_repeated))*100:.2f}% de los tokens repetidos se repite más de una vez.")

Hay 7555 tokens que no se repiten.
Hay 2800 tokens que se repiten más de una vez.
Solo un 37.06% de los tokens repetidos se repite más de una vez.


In [67]:
counts = pd.DataFrame(0, index = np.arange(len(df["tokenized"])),
                      columns=tokens_repeated)

In [68]:
counts.head()

Unnamed: 0,jueves,colecci,tica,ocurrido,f,luisperezurra,lejos,mochila,tal,facebook,...,continua,play,obligatorio,a5ianksp,trayectoria,biles,cula,chat,mo,profesional
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now we need to fill in each cell:

In [69]:
for i, tokens in enumerate(df["tokenized"]):
    for token in tokens:
        if token in tokens_repeated:
            counts.loc[i,token] += 1

We keep only tokens between 5 and 100 counts.

In [70]:
word_counts = counts.sum()
counts = counts[word_counts[(word_counts <= 100) & (word_counts >= 5)].index]

In [71]:
len(counts)

2717

## Model

#### **(Currently working on it)**

In [72]:
#val = counts.iloc[2716]
#counts = counts.iloc[0:2716]

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(counts, df["Max_Rt_Fav"], test_size=0.2, random_state=1)

In [15]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()

clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [16]:
print("Train score:", clf.score(X_train, y_train)) #0.895
print("Test score:", clf.score(X_test, y_test)) #0.78 --> como acc,train>acc,test --> indicio de overfitting

Train score: 0.9585826046939715
Test score: 0.5588235294117647


Due to train score is much higher than test score, there are signs of overfitting. 

In [17]:
clf = DecisionTreeClassifier(max_depth=3)

clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

In [18]:
print("Train score:", clf.score(X_train, y_train)) 
print("Test score:", clf.score(X_test, y_test)) 

Train score: 0.6148182236539347
Test score: 0.6433823529411765


Great! We can see that having fixed the tree max depth to 3, the model is less complex and dont overfit the data. Now, the test score is higher than train score.

In [19]:
mse = ((y_test - predictions)**2).sum()/len(predictions)
print(mse)

452614.67463235295


In [20]:
mean_upvotes = df["Max_Rt_Fav"].mean()
std_upvotes = df["Max_Rt_Fav"].std()

In [21]:
print("Mean fav/rts: {:.2f}".format(mean_upvotes))
print("Standar deviation of fav/rts: {:.2f}".format(std_upvotes))
print("The root mean square error is: {:.2f}".format(mse**0.5))

Mean fav/rts: 82.79
Standar deviation of fav/rts: 1824.09
The root mean square error is: 672.77


As we can see, the mean of the rts/favs is 82.79 and the standard deviation is 1824.09. If we take the square root of the MSE we see that it is 672.77, which can be interpreted as that our average error is 672.77 fav/rts away from the real value. This is a fairly high value, but this is fine for the purposes of this project. Must take into account that that we are dealing with a very small dataset, nor are we using transfer learning.




In [22]:
h = clf.predict(counts.iloc[4:100])

In [73]:
counts = counts.values.reshape((2717, 1009,1))

## DNN Model 
#### **(Currently working on it)**

Using keras (from tensorflow) we train a neural network using LSTM architecture. 



In [92]:
from keras.layers import Dense, Activation, LSTM
from keras.models import Sequential
import tensorflow as tf
from tensorflow import keras
from keras.regularizers import l1



X_train, X_test, y_train, y_test = train_test_split(counts, df["Max_Rt_Fav"], test_size=0.2, random_state=1)

hidden_layer = 16
activation = "relu"
input_layer = (1009)
output_layer = 1
epochs = 5

model = Sequential()
model.add(LSTM(8, input_shape=(1009, 1)))
model.add(Dense(1))

model.compile(optimizer="adam", loss="mean_squared_error")

history = model.fit(X_train, y_train, batch_size=16, epochs=epochs, validation_data=(X_test,y_test))

y_pred = model.predict(X_test)

Train on 2173 samples, validate on 544 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
