## Data understanding

The dataset contains information about the date, the used device and the twitter name, which are not included further. The goal is to create a classifier. We use pre-labeled tweets classified as either `negative` (= 0), `neutral` (= 2), or `positive` (= 4). 

### Data import

In [26]:
import pandas as pd

encoding_type = "latin-1" # required for text encoding

df_train = pd.read_csv("case_company perception_data/case4_train.csv", header=None, encoding=encoding_type)
df_test = pd.read_csv("case_company perception_data/case4_test.csv", header=None)
df_test.head()

Unnamed: 0,0,1,2,3,4,5
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


In [27]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   0       1600000 non-null  int64 
 1   1       1600000 non-null  int64 
 2   2       1600000 non-null  object
 3   3       1600000 non-null  object
 4   4       1600000 non-null  object
 5   5       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


The imported data set does not contain any column labels! Only the columns with the index 0 and 5 are of interest for training the model. These contain the classification and the associated tweet. So the columns are renamed according to the corresponding variables.

In [28]:
df_train = df_train.rename(columns={0: "labels", 5: "text"})
df_test = df_test.rename(columns={0: "labels", 5: "text"})

df_test.head()

Unnamed: 0,labels,1,2,3,4,text
0,4,3,Mon May 11 03:17:40 UTC 2009,kindle2,tpryan,@stellargirl I loooooooovvvvvveee my Kindle2. ...
1,4,4,Mon May 11 03:18:03 UTC 2009,kindle2,vcu451,Reading my kindle2... Love it... Lee childs i...
2,4,5,Mon May 11 03:18:54 UTC 2009,kindle2,chadfu,"Ok, first assesment of the #kindle2 ...it fuck..."
3,4,6,Mon May 11 03:19:04 UTC 2009,kindle2,SIX15,@kenburbary You'll love your Kindle2. I've had...
4,4,7,Mon May 11 03:21:41 UTC 2009,kindle2,yamarama,@mikefish Fair enough. But i have the Kindle2...


With 1,600,000 tweets, the `df_train` dataset is too large for modeling purposes. To save time, only 5,000 observations classified as `negative` (= 0) and 5,000 classified as `positive` (= 4) are used to reduce the computation time.

In [29]:
mask_0 = df_train["labels"] == 0
mask_4 = df_train["labels"] == 4

# Use boolean indexing to filter the data frame
filtered_df_0 = df_train[mask_0].head(5000)
filtered_df_4 = df_train[mask_4].head(5000)

df_train = pd.concat([filtered_df_0, filtered_df_4], ignore_index=True)
df_train.head()

Unnamed: 0,labels,1,2,3,4,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


`df_test` contains some observations that are classified as `neutral`, while `df_train` contains only observations that are classified as `positive` or `negative`. To keep things simple, we will focus only on these two labels and drop the `neutral` observations from the `df_test` dataset.

In [30]:
df_test["labels"].unique()

array([4, 0, 2])

In [31]:
df_test = df_test[df_test["labels"] != 2].reset_index(drop=True)
df_test["labels"].unique()

array([4, 0])

## Modeling and evaluation I

In [32]:
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

import spacy
nlp = spacy.load("en_core_web_md") # load english model

In [33]:
help_text_train_list = []

for i in range(len(df_train)):
    help_text_train = nlp(df_train.loc[i, "text"]) # creating embeddings for each tweet
    help_text_train_list.append(help_text_train.vector)

In [34]:
help_text_test_list = []
for i in range(len(df_test)):
    help_text_test = nlp(df_test.loc[i, 'text'])  # creating embeddings for each tweet
    help_text_test_list.append(help_text_test.vector)

We store the embeddings in a dataframe, which will help us set up our model. Each row of the newly defined dataframe consists of the embeddings of a single tweet.

In [35]:
X_train = pd.DataFrame(help_text_train_list)
X_test = pd.DataFrame(help_text_test_list)

X_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,290,291,292,293,294,295,296,297,298,299
0,-1.150343,1.982271,-2.532449,-0.122276,2.759819,0.937225,0.976607,2.90537,-0.29508,-0.396499,...,0.339731,0.541458,-0.103209,-0.108209,-2.091317,0.751077,0.727215,1.228262,-1.930907,1.186035
1,-0.13812,1.720601,-2.325878,-1.046416,1.455242,0.16279,0.733155,3.236209,-0.717819,-1.421322,...,1.035674,-0.963529,1.211805,-0.954708,-1.14083,0.726342,1.421039,-0.45664,-3.134573,1.992079
2,-2.365839,0.494969,-2.006724,1.103424,2.779244,1.496742,0.609426,4.192228,-1.191918,-0.154026,...,-0.365396,1.027518,-0.089048,0.963603,-0.949952,2.96361,-0.965779,-1.571609,-2.333352,0.872322
3,-0.317766,0.726643,-4.786879,-0.841273,4.377829,-0.636914,-0.691929,5.463463,0.053415,0.83154,...,-0.692915,-0.431903,1.351665,-1.402728,-2.742922,-0.66138,1.397619,0.050472,-5.147802,-0.659592
4,1.086641,1.551972,-3.302518,-2.613626,-1.569944,1.95115,-0.059932,4.777886,-3.323206,2.323077,...,2.161126,0.397071,4.017306,-3.820434,-1.991124,0.048167,0.402811,2.411552,-3.352856,2.692743


We utilize the `LabelEncoder()` function to transform our data into a range between 0 (`negative`) and 1 (`positive`).

In [36]:
# Preprocess the label data

label_encoder = LabelEncoder()

y_train = label_encoder.fit_transform(df_train["labels"])
y_test = label_encoder.fit_transform(df_test["labels"])

y_train

array([0, 0, 0, ..., 1, 1, 1])

We now create our sequential Neural Network model:

In [37]:
# Create a Sequential model
model = Sequential()

# Add a dense layer with 64 neurons, ReLU activation function, and input dimension based on X_train
model.add(Dense(64, activation="relu", input_dim=X_train.shape[1]))

# Add another dense layer with 1 neuron and sigmoid activation function for binary classification
model.add(Dense(1, activation="sigmoid"))

In [38]:
# Compile the model

# Use the Adam optimizer, binary crossentropy loss (common for binary classification), and accuracy as a metric
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

In [39]:
# Train the model

# X_train and y_train are the training data and labels
# Use 10 epochs, a batch size of 32, and allocate 20% of the data for validation during training
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x14a768c40>

In [40]:
# Evaluate the model

accuracy = model.evaluate(X_test, y_test)[1]
print(f"Test Accuracy: {accuracy}")

Test Accuracy: 0.6963788270950317


## Prediction

Finally, the model can be used to classify new utterances into `negative` = 0 or `positive` = 1.

In [41]:
prediction = nlp("EEBDA is such a great course!")
model.predict(pd.DataFrame([prediction.vector]))



array([[0.99736106]], dtype=float32)