5.3.1 Text Classification
• Design and implement a system to recommend a conference to a researcher given the title of his new
article.
• The system should use the provided Conference Proceedings training data. You should implement
the sub-tasks (feature extraction, dimensionality reduction, and classifier) by yourself.
• You are free to select the algorithms you prefer for each sub-task. However, it is recommended that
you test and compare multiple methods.
• Evaluate you system on the training set by using the cross-validation approach. Provide the confusion
matrix of your system output.
• Evaluation should be done in terms of Micro-average precision, recall and F1 measures.
• Once you found the best model on the training set, evaluate your model on the test set and report the
results.

In [11]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC

In [28]:
train_data = pd.read_csv('DBLPTrainset.txt', sep='\t', header=None, names=['ID', 'Conference', 'Title'])
test_data = pd.read_csv('DBLPTestset.txt', sep='\t', header=None, names=['ID', 'Title'])
ground_truth_labels = pd.read_csv('DBLPTestGroundTruth.txt', sep='\t', header=None, names=['ID', 'Conference'])

In [29]:
# TF-IDF Vectorization for both training and test sets
tfidf_Vectorizer = TfidfVectorizer(stop_words='english')
X_train = tfidf_Vectorizer.fit_transform(train_data['Title'])
y_train = train_data['Conference']

X_test = tfidf_Vectorizer.transform(test_data['Title'])

In [30]:
# Train a Support Vector Machine (SVM) classifier
svm_classifier = SVC(kernel='linear', C=1.0)
svm_classifier.fit(X_train, y_train)

In [31]:
y_pred_test = svm_classifier.predict(X_test)

In [32]:
# Combine test data with predicted labels
test_data['Predicted_Conference_Label'] = y_pred_test


In [33]:
# Combine test data with predicted labels and ground truth labels
evaluation_data = pd.DataFrame({'ID': test_data['ID'], 'Predicted_Conference_Label': y_pred_test})
evaluation_data = pd.merge(evaluation_data, ground_truth_labels, on='ID')

In [36]:
# Calculate accuracy
accuracy = (evaluation_data['Predicted_Conference_Label'] == evaluation_data['Conference']).mean()
print(f"Accuracy on the test set: {accuracy}")

Accuracy on the test set: 0.8639193596205159


5.3.2 Named Entity Recognition
• Design and implement a system to extract and classify named entities in tweets.
• The system should use the provided NER Twitter training data. You should implement the sub-tasks
(feature extraction, and classifier) by yourself.
• You are free to select the algorithms you prefer for each sub-task. However, it is recommended that you test and compare multiple methods.
• Evaluate you system on the training set by using the cross-validation approach.
• Evaluation should be done in terms of micro-average of precision, recall and F1 measures.
• Evaluate your best model you found on the training set on the test set. Report your results.