## Imports

We start by importing the needed modules.

In [None]:
import pandas as pd
import numpy as np
import pickle
import joblib
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import jaccard_score, f1_score

## Read the Data
**Article column** represents the text that we want to classify.<br>
**next 10 coumns** represent labels for each article ['فن ومشاهير','أخبار','رياضة','اقتصاد','تكنولوجيا',
 'اسلام و أديان','سيارات','طقس','منوعات أخرى','صحة','مطبخ']. For each article we have binary list consist of 10 item mapped to our labels each item either " *0* " means that we couldn't assign this class for the article " *1* " means that we could assign this class for the article.<br>
**topics_number** represents the number of label that we assign for every article.

In [None]:
data_folder_path = "../Data/"
train_df = pd.read_csv(data_folder_path+"train.tsv",sep="\t")
validation_df = pd.read_csv(data_folder_path+"validation.tsv",sep="\t")
testing_df = pd.read_csv(data_folder_path+"test_unlabaled.tsv",sep="\t")

## Data Exploration

Below we show the first five rows of each dataset.

In [None]:
train_df.head()

In [None]:
# show first 5 rows of the validation data
validation_df.head()

In [None]:
# show first 5 rows of the testing data
testing_df.head()

## TFIDF
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: **how many times a word appears in a document**, and the **inverse document frequency of the word across a set of documents**.<br>
given word *t* and document *d* from set of documents *D* and *N* is the total number of documents in the corpus we calculate **tf-id** as follows:

$$tfidf(t,d,D) = tf(t,d).idf(t,D)$$
where:
$$tf(t,d) = log(1+freq(t,d))$$
$$idf(t,D) = log(\frac{N}{count(d\in D:t \in d )})$$


In [None]:
#initialise tfidf vectoriser object
tfidf = TfidfVectorizer(analyzer='word', max_features=10000, ngram_range=(1,3))
#alocate X and Y values for training,validation and testing sets
X_train = train_df.iloc[:,0]
y_train = train_df.iloc[:,1:]
X_validation = validation_df.iloc[:, 0]
y_validation = validation_df.iloc[:,1:]
X_test = testing_df.iloc[:,0]
print("training shapes: Features {}, labels {}".format(X_train.shape, y_train.shape))
print("Validation shapes: Features {}, labels {}".format(X_validation.shape, y_validation.shape))
print("Testing shape: Features {}".format(X_test.shape))

## Train SVM

Below we extract the tf-idf features for the training and validation datasets, and train our model

In [None]:
# initiate the model 
svc = LinearSVC()
#extract tfidf feature vector from taining data
X_train = tfidf.fit_transform(X_train)
#extract tfidf feature vector from calidation data
X_validation = tfidf.transform(X_validation)
# train the model on training data
clf = OneVsRestClassifier(svc)
clf.fit(X_train, y_train)

## Results on validation and testing sets

Next we evaluate the performance of our model on the validation data, using Jaccard and F1_score.

In [None]:
y_val_pred = clf.predict(X_validation)
print("validation jaccard sample: {}, f1_score sample:{}".
      format(jaccard_score(y_validation, y_val_pred, average="samples"),
             f1_score(y_validation, y_val_pred, average="samples")))
print("validation jaccard macro: {}, f1_score macro:{}".
      format(jaccard_score(y_validation, y_val_pred, average="macro"),
             f1_score(y_validation, y_val_pred, average="macro")))
print("validation jaccard micro: {}, f1_score micro:{}".
      format(jaccard_score(y_validation, y_val_pred, average="micro"),
             f1_score(y_validation, y_val_pred, average="micro")))

## Model Saving

Below we save the needed files to use later in our API.

In [None]:
joblib.dump(tfidf, '../models/tfidf_vectorizer.pkl',compress=6)
joblib.dump(clf,"../models/svc.sav")

# Submission File Creation

We first extract tfidf feature vector for the testing data

In [None]:
X_test = tfidf.transform(X_test)

We predict the labels for the test set.

In [None]:
preds = clf.predict(X_test)

Next we save the outputs as a tsv file ready for submission.

In [None]:
df = pd.DataFrame(data=preds, index=None, columns=None)
df.to_csv("../Data/outputs/answer.tsv", header=False, index=False, sep="\t")