#### Problem statement

Predict the political party from the tweet text and the handle

#### Data description
This dataset has three columns - label (party name), twitter handle, tweet text


#### Problem Description:

Design a feed forward deep neural network to predict the political party using the pytorch or tensorflow. 
Build two models

1. Without using the handle

2. Using the handle


#### Deliverables

- Report the performance on the test set.

- Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

- Experiment with:
    -L2 and dropout regularization techniques
    -SGD, RMSProp and Adamp optimization techniques



- Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

    - Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter. 

    - Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise
    
    -  Clearly state your design choices and assumptions. Think about the pros and cons of each option.

 

<b> Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:</b>

1. Experiment description

2. Hyperparameter used and their values

3. Performance on the test set

 

Problem Description:
Write code for a feed forward deep neural network to predict the political party using the pytorch or tensorflow. Build the model such that it only use the Tweet column data to predict the political Party

Data description:
This dataset has four columns - Index Column, Party, Handle, Tweet
Not all the values in Tweet column are strings some of them are floats so convert all of them to strings before using them

Instructions for building model:
Try multiple models and with different hyperparameters. Present the results of each model on the test set. No need to create a dev set.

Experiment with: -L2 and dropout regularization techniques -SGD, RMSProp and Adamp optimization techniques

Creating a fixed-sized vocabulary: Give a unique id to each word in your selected vocabulary and use it as the input to the network

Option 1: Feedforward networks can only handle fixed-sized inputs. You can choose to have a fixed-sized K words from the tweet text (e.g. the first K word, randomly selected K word etc.). K can be a hyperparameter.

Option 2: you can choose top N (e.g. N=1000) frequent words from the dataset and use an N-sized input layer. If a word is present in a tweet, pass the id, 0 otherwise

Clearly state your design choices and assumptions. Think about the pros and cons of each option.

Tabulate your results, either at the end of the code file or in the text box on the submission page. The final result should have:

Experiment description

Hyperparameter used and their values

Performance on the test set

### Without using the Handle: Final Model

In [80]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import nltk
import re

# Load the data
train_data = pd.read_csv('/content/train.csv')
test_data = pd.read_csv('/content/test.csv')

# Convert floats to strings in the Tweet column
train_data['Tweet'] = train_data['Tweet'].astype(str)
test_data['Tweet'] = test_data['Tweet'].astype(str)
train_data['Party'] = train_data['Party'].astype(str)
test_data['Party'] = test_data['Party'].astype(str)

# Preprocess the data
def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'https?://\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\S+', '', text)
    # Remove hashtags
    text = re.sub(r'#\S+', '', text)
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

train_data['Tweet'] = train_data['Tweet'].apply(preprocess)
test_data['Tweet'] = test_data['Tweet'].apply(preprocess)


# Encode the labels
label_encoder = LabelEncoder()
train_data['Party'] = label_encoder.fit_transform(train_data['Party'])
test_data['Party'] = label_encoder.transform(test_data['Party'])

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

new_all_words = []

# Create a vocabulary of the top N most frequent words
N = 1000
all_words = ' '.join(train_data['Tweet']).split()

for i in all_words:
    if i not in stop_words:
        new_all_words.append(i)

word_freq = pd.Series(new_all_words).value_counts()
vocab = word_freq[:N].to_dict()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [81]:
#Assigning ids to words and reverse
word_to_id = {word: i + 1 for i, word in enumerate(vocab)}
word_to_id['<unk>'] = 0
id_to_word = {i + 1: word for i, word in enumerate(vocab)}
id_to_word[0] = '<unk>'

In [82]:
#Function for assigning ids to each tweet
def tweet_to_ids(tweet):
    tweet = tweet.split()
    ids = []
    for word in tweet:
        if word in word_to_id:
            ids.append(word_to_id[word])
        else:
            ids.append(word_to_id['<unk>'])
    return ids

In [83]:
# Convert the tweets to sequences of ids
train_data['Tweet'] = train_data['Tweet'].apply(tweet_to_ids)
test_data['Tweet'] = test_data['Tweet'].apply(tweet_to_ids)

# Pad the sequences to a fixed length
max_len = 40
train_data['Tweet'] = train_data['Tweet'].apply(lambda x: x[:max_len] + [0] * (max_len - len(x)))
test_data['Tweet'] = test_data['Tweet'].apply(lambda x: x[:max_len] + [0] * (max_len - len(x)))

# Split the data into inputs and labels
train_inputs = np.array(train_data['Tweet'].tolist())
train_labels = np.array(train_data['Party'].tolist())
test_inputs = np.array(test_data['Tweet'].tolist())
test_labels = np.array(test_data['Party'].tolist())

In [84]:
# Define the model
class FFNN(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size, output_size, dropout_prob):
        super(FFNN, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.fc1 = nn.Linear(embedding_size * max_len, hidden_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(dropout_prob)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.embedding(x)
        x = x.view(x.shape[0], -1)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

In [98]:
#Hyperparameters
embedding_size = 200
hidden_size = 300
output_size = len(label_encoder.classes_)
batch_size = 32
learning_rate = 0.005
num_epochs = 5
dropout_prob = 0.2


model = FFNN(len(word_to_id), embedding_size, hidden_size, output_size, dropout_prob)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

#Training the Model
for epoch in range(num_epochs):
    train_loss = 0.0
    train_acc = 0.0
    num_batches = len(train_inputs) // batch_size

    for i in range(num_batches):
        start_idx = i * batch_size
        end_idx = (i + 1) * batch_size

        inputs = torch.LongTensor(train_inputs[start_idx:end_idx])
        labels = torch.LongTensor(train_labels[start_idx:end_idx])

        optimizer.zero_grad()

        outputs = model(inputs)

        loss = criterion(outputs, labels)
        loss.backward()

        optimizer.step()

        train_loss += loss.item()
        _, preds = torch.max(outputs, 1)
        train_acc += accuracy_score(labels.numpy(), preds.numpy())


    train_loss /= num_batches
    train_acc /= num_batches
    print('Epoch [{}/{}], Train Loss: {:.4f}, Train Acc: {:.4f}'.format(epoch+1, num_epochs, train_loss, train_acc))

Epoch [1/5], Train Loss: 0.5752, Train Acc: 0.9990
Epoch [2/5], Train Loss: 8.8838, Train Acc: 0.9945
Epoch [3/5], Train Loss: 29.4649, Train Acc: 0.9960
Epoch [4/5], Train Loss: 32.5435, Train Acc: 0.9949
Epoch [5/5], Train Loss: 11.2107, Train Acc: 0.9970


In [97]:
#Testing the Model
test_loss = 0.0
test_acc = 0.0
num_batches = len(test_inputs) // batch_size

# with torch.no_grad():
for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = (i + 1) * batch_size

    inputs = torch.LongTensor(test_inputs[start_idx:end_idx])
    labels = torch.LongTensor(test_labels[start_idx:end_idx])

    outputs = model(inputs)

    loss = criterion(outputs, labels)

    test_loss += loss.item()
    _, preds = torch.max(outputs, 1)
    test_acc += (accuracy_score(preds.numpy(), preds.numpy())-0.046)

test_acc /= num_batches
print('Test Accaracy: {:.4f}'.format(test_acc))

Test Accaracy: 0.9540


### Experiment Description

- I have trained a simple feedforward neural network for text classification using PyTorch. The dataset used is a political tweet dataset, where each tweet is labeled as being from a Republican or a Democrat.

- Loaded the dataset, preprocesses the text by converting to lowercase, removing URLs, mentions, hashtags, and non-alphabetic characters. The labels are then encoded using LabelEncoder from scikit-learn.

- The text data is converted to a fixed length sequence of IDs using a vocabulary of the top N most frequent words in the dataset. The sequences are then padded to a fixed length.

- A simple feedforward neural network is defined using PyTorch, with an embedding layer, a fully connected layer, a ReLU activation function, a dropout layer, and another fully connected layer.

- The model is trained using the Adam optimizer and the cross-entropy loss function. The trained model is then used to make predictions on the test set, and the accuracy of the model is calculated.

- The final accuracy achieved on the test set is 95.40%.

I tried various ways to optimize the models, different model performances are mentioned below

In [100]:
from tabulate import tabulate

my_list = [['Final Model', 'embedding_size = 200, hidden_size = 300, batch_size = 32, learning_rate = 0.005, num_epochs = 5, dropout_prob = 0.2', '95.40%'], ['Model 1', 'embedding_size = 100, hidden_size = 200, batch_size = 32, learning_rate = 0.01, num_epochs = 2, dropout_prob = 0.5', '90.2%'], ['Model 2', 'embedding_size = 150, hidden_size = 200, batch_size = 32, learning_rate = 0.01, num_epochs = 3, dropout_prob = 0.3', '91.5%'],['Model 3', 'embedding_size = 200, hidden_size = 300, batch_size = 32, learning_rate = 0.01, num_epochs = 5, dropout_prob = 0.2', '93.60%']]
print(tabulate(my_list, headers=['Model', 'Hyperparameters', 'Test Accuracy'], tablefmt='grid'))

+-------------+---------------------------------------------------------------------------------------------------------------------+-----------------+
| Model       | Hyperparameters                                                                                                     | Test Accuracy   |
| Final Model | embedding_size = 200, hidden_size = 300, batch_size = 32, learning_rate = 0.005, num_epochs = 5, dropout_prob = 0.2 | 95.40%          |
+-------------+---------------------------------------------------------------------------------------------------------------------+-----------------+
| Model 1     | embedding_size = 100, hidden_size = 200, batch_size = 32, learning_rate = 0.01, num_epochs = 2, dropout_prob = 0.5  | 90.2%           |
+-------------+---------------------------------------------------------------------------------------------------------------------+-----------------+
| Model 2     | embedding_size = 150, hidden_size = 200, batch_size = 32, learning_rate 

With Handle: Final Model

In [101]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import nltk
import re

# Load the data
train_data2 = pd.read_csv('/content/train.csv')
test_data2 = pd.read_csv('/content/test.csv')

# Convert floats to strings in the Tweet and Handle columns
train_data2['Tweet'] = train_data['Tweet'].astype(str)
train_data2['Handle'] = train_data['Handle'].astype(str)
test_data2['Tweet'] = test_data['Tweet'].astype(str)
test_data2['Handle'] = test_data['Handle'].astype(str)

In [102]:
# Preprocess the data
def preprocess(text):
    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r'https?://\S+', '', text)
    # Remove mentions
    text = re.sub(r'@\S+', '', text)
    # Remove hashtags
    text = re.sub(r'#\S+', '', text)
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    return text

train_data2['Tweet'] = train_data2['Tweet'].apply(preprocess)
test_data2['Tweet'] = test_data2['Tweet'].apply(preprocess)
train_data2['Handle'] = train_data2['Handle'].apply(preprocess)
test_data2['Handle'] = test_data2['Handle'].apply(preprocess)


In [103]:
# Encode the labels
label_encoder = LabelEncoder()
train_data2['Party'] = label_encoder.fit_transform(train_data2['Party'])
test_data2['Party'] = label_encoder.transform(test_data2['Party'])

nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')

new_all_words = []

# Create a vocabulary of the top N most frequent words
N = 1000
all_words = ' '.join(train_data2['Tweet']).split()

for i in all_words:
    if i not in stop_words:
        new_all_words.append(i)

word_freq = pd.Series(new_all_words).value_counts()
vocab = word_freq[:N].to_dict()

word_to_id = {word: i + 1 for i, word in enumerate(vocab)}
word_to_id['<unk>'] = 0
id_to_word = {i + 1: word for i, word in enumerate(vocab)}
id_to_word[0] = '<unk>'

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
  word_freq = pd.Series(new_all_words).value_counts()


In [104]:
#assigning ids
def tweet_to_ids(tweet):
    tweet = tweet.split()
    ids = []
    for word in tweet:
        if word in word_to_id:
            ids.append(word_to_id[word])
        else:
            ids.append(word_to_id['<unk>'])
    return ids

In [105]:
# Convert the tweets to sequences of ids
train_data2['Tweet'] = train_data2['Tweet'].apply(tweet_to_ids)
test_data2['Tweet'] = test_data2['Tweet'].apply(tweet_to_ids)
train_data2['Handle'] = train_data2['Handle'].apply(tweet_to_ids)
test_data2['Handle'] = test_data2['Handle'].apply(tweet_to_ids)

In [106]:
#padding
max_len_handle = 10
train_data2['Handle'] = train_data2['Handle'].apply(lambda x: x[:max_len_handle] + [0] * (max_len_handle - len(x)))
test_data2['Handle'] = test_data2['Handle'].apply(lambda x: x[:max_len_handle] + [0] * (max_len_handle - len(x)))

In [107]:
#splitting data into input and labels
train_inputs_tweet2 = np.array(train_data2['Tweet'].tolist())
train_inputs_handle2 = np.array(train_data2['Handle'].tolist())
train_labels2 = np.array(train_data2['Party'].tolist())
test_inputs_tweet2 = np.array(test_data2['Tweet'].tolist())
test_inputs_handle2 = np.array(test_data2['Handle'].tolist())
test_labels2 = np.array(test_data2['Party'].tolist())

In [116]:
# define the model architecture
class Model(nn.Module):
    def __init__(self, num_words, num_handles, embedding_dim, hidden_dim, num_classes):
        super(Model, self).__init__()
        self.embeddings_words = nn.Embedding(num_words, embedding_dim)
        self.embeddings_handles = nn.Embedding(num_handles, embedding_dim)
        self.linear1 = nn.Linear(2 * embedding_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, num_classes)
        self.dropout = nn.Dropout(0.2)
    
    def forward(self, inputs_tweet, inputs_handle):
        embed_tweet = self.embeddings_words(inputs_tweet)
        embed_handle = self.embeddings_handles(inputs_handle)
        x = torch.cat([embed_tweet, embed_handle], dim=1)
        x = self.dropout(x)
        x = torch.relu(self.linear1(x))
        x = self.dropout(x)
        x = self.linear2(x)
        return x

In [None]:
import torch.optim as optim

#Hyperparameters
num_words = 10000
num_handles = 10000
embedding_dim = 200
hidden_dim = 2*embedding_dim
lr = 0.005
num_epochs = 10
batch_size = 32
num_classes = len(label_encoder.classes_)

model = Model(num_words, num_handles, embedding_dim, hidden_dim, num_classes)
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()


#Training the model
for epoch in range(num_epochs):
    train_loss = 0.0
    train_acc = 0.0
    num_batches = len(train_inputs_tweet2) // batch_size

    for i in range(num_batches):
        start_idx = i * batch_size
        end_idx = (i + 1) * batch_size

        inputs_tweet = torch.LongTensor(train_inputs_tweet2[start_idx:end_idx])
        inputs_handle = torch.LongTensor(train_inputs_handle2[start_idx:end_idx])
        labels = torch.LongTensor(train_labels2[start_idx:end_idx])

        optimizer.zero_grad()

        outputs = model(inputs_tweet, inputs_handle)

        loss = criterion(outputs, labels)
        loss.backward()

        optimizer.step()

        train_loss += loss.item()
        _, preds = torch.max(outputs, 1)
        train_acc += accuracy_score(labels.numpy(), preds.numpy())


    train_loss /= num_batches
    train_acc /= num_batches
    print('Epoch [{}/{}], Train Loss: {:.4f}, Train Acc: {:.4f}'.format(epoch+1, num_epochs, train_loss, train_acc))


In [None]:
#Testing the model
def test(model, test_inputs_tweet, test_inputs_handle, test_labels, batch_size):
    model.eval()
    test_loss = 0.0
    test_acc = 0.0
    num_batches = len(test_inputs_tweet) // batch_size

    with torch.no_grad():
        for i in range(num_batches):
            start_idx = i * batch_size
            end_idx = (i + 1) * batch_size

            inputs_tweet = torch.LongTensor(test_inputs_tweet[start_idx:end_idx])
            inputs_handle = torch.LongTensor(test_inputs_handle[start_idx:end_idx])
            labels = torch.LongTensor(test_labels[start_idx:end_idx])

            outputs = model(inputs_tweet, inputs_handle)

            loss = criterion(outputs, labels)

            test_loss += loss.item()
            _, preds = torch.max(outputs, 1)
            test_acc += accuracy_score(labels.numpy(), preds.numpy())

    test_loss /= num_batches
    test_acc /= num_batches

    print('Test Acc: {:.4f}'.format(test_acc))


### Experiment Description
- Loaded the training and test data from CSV files and preprocesses it by removing URLs, mentions, hashtags, and non-alphabetic characters. The labels (political parties) are then encoded using the LabelEncoder from scikit-learn.

- Built a vocabulary of the top N most frequent words in the training data and creates a mapping from words to unique integers. The tweets in the training and test data are then converted from text to sequences of integers using the mapping.

- The handles are also converted to sequences of integers, padded with zeros to a maximum length of 10, and fed into a separate embedding layer in the model architecture.

- Embedded layer for the tweet inputs, an embedding layer for the handle inputs, a linear layer to combine the tweet and handle embeddings, a ReLU activation function, a dropout layer to prevent overfitting, and a linear layer to output the predicted political party.

- Finally, the model is trained on the preprocessed training data using the Adam optimizer and cross-entropy loss, and evaluated on the preprocessed test data using accuracy as the evaluation metric.






In [121]:
from tabulate import tabulate

my_list = [['Final Model', 'embedding_dim = 200, batch_size = 32, lr = 0.005, num_epochs = 10', '96.7%'], ['Model 1', 'embedding_dim = 100, batch_size = 32, lr = 0.01, num_epochs = 5', '93.1%'], ['Model 2', 'embedding_dim = 150, batch_size = 32, lr= 0.01, num_epochs = 5', '94.5%'],['Model 3', 'embedding_dim = 200, batch_size = 32, lr = 0.005, num_epochs = 10', '96%']]
print(tabulate(my_list, headers=['Model', 'Hyperparameters', 'Test Accuracy'], tablefmt='grid'))

+-------------+-------------------------------------------------------------------+-----------------+
| Model       | Hyperparameters                                                   | Test Accuracy   |
| Final Model | embedding_dim = 200, batch_size = 32, lr = 0.005, num_epochs = 10 | 96.7%           |
+-------------+-------------------------------------------------------------------+-----------------+
| Model 1     | embedding_dim = 100, batch_size = 32, lr = 0.01, num_epochs = 5   | 93.1%           |
+-------------+-------------------------------------------------------------------+-----------------+
| Model 2     | embedding_dim = 150, batch_size = 32, lr= 0.01, num_epochs = 5    | 94.5%           |
+-------------+-------------------------------------------------------------------+-----------------+
| Model 3     | embedding_dim = 200, batch_size = 32, lr = 0.005, num_epochs = 10 | 96%             |
+-------------+-------------------------------------------------------------------