# Text Classification Problem

We expect a candidate to develop a solution that is capable to classify provided texts in one of **four** classes.
 
You may find the dataset in the **data** folder:
- train.csv contains training dataset. There are four columns in this file:
    - id - column with unique identifier of each data sample
    - category - target variable
    - title - document title
    - description - document text
- test.csv contains test dataset and all the columns are the same except category as it is unknown and should be predicted.
- sample_submission.csv - an example of how resulting submission shoul look like.

Your model should give as an output a probability of each sample belonging to each class.

To submit your solution put this **solution.ipynb** file and generated **submission.csv** in a **zip** file.

We are interested to see how candidate implements his/her typical pipeline to solve machine learning problems starting with a dataset containing both data and target variable.

We **do not** expect a state-of-the-art solution here, rather a code that demonstrates candidate's understanding of crucial parts in ML models development. However, it would be a plus to see a brief description on how to get to the near-state-of-the-art solution in conclusions.

#### Imports

In [1]:
import numpy as np
import pandas as pd


# add needed libraries here
from utils import ReplaceDiatrics,PunctRemove,Tfidf_fit,Tfidf_transform,PreprocessText
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pickle
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, f1_score
from sklearn.metrics import confusion_matrix

#### Your solution

In [2]:
# put your code in this and the following blocks

In [3]:
df=pd.read_csv("data/train.csv")

#check dataset balance
df['category'].value_counts()

3    30000
2    30000
1    30000
0    30000
Name: category, dtype: int64

In [4]:

#we will work with both texts together
df["text_0"]=df["title"]+df["description"]



processed_df=PreprocessText(df)
processed_df.head()

X,y=processed_df.text_0,processed_df.category

-----Text Preprocessing Started-----
 ----Removing Stopwords---- 
 ----Removing Diatrics---- 
 ----Removing Punctuation---- 
-----Text Preprocessing Finished-----


In [5]:
#train vectorizer and transform text
Tfidf_fit(X)
X=Tfidf_transform(X)


-----Tfidf fitting-----
-----Tfidf transforming-----


In [6]:
#split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [7]:
#select best hyperparameters



SVM = Pipeline([('clf', SVC(random_state=1,C=10.0,probability=True))])
'''
param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range,
               'clf__kernel': ['linear']},
              {'clf__C': param_range,
               'clf__gamma': param_range,
               'clf__kernel': ['rbf']}]

gs = GridSearchCV(estimator=SVM,
                  param_grid=param_grid,
                  scoring='f1_macro',
                  cv=10,
                  n_jobs=1)
gs = gs.fit(X_train, y_train)

print('--> Best score: ',gs.best_score_)
print('--> Best parameters: \n',gs.best_params_)


#Select best parameters

SVM = gs.best_estimator_
'''
#final values

SVM.fit(X_train, y_train)


# predict labels on validation set
predictions_SVM = SVM.predict(X_test)

# Use f1 score function 
print("F1 Score -> ",f1_score(y_test, predictions_SVM, average='macro'))


F1 Score ->  0.9164324581721721


In [8]:
##store model
pickle.dump(SVM, open('svm.sav', 'wb'))

#### Model testing

In [9]:
#read test data
test_df=pd.read_csv("data/test.csv")
test_answers_df=pd.read_csv("data/test_answers.csv")

test_df=pd.merge(test_df, test_answers_df, on="id") 

test_df["text_0"]=test_df["title"]+test_df["description"]



processed_test_df=PreprocessText(test_df)
X_t=processed_test_df.text_0
X_t=Tfidf_transform(X_t)


svm_model=pickle.load(open('svm.sav', 'rb'))


predicted_cat=svm_model.predict(X_t)
prob_classes=svm_model.predict_proba(X_t)
test_df["pred_category"]=predicted_cat


print("F1 Score -> ",f1_score(test_df["category"], test_df["pred_category"], average='macro'))
confusion_matrix(test_df["category"],test_df["pred_category"])

-----Text Preprocessing Started-----
 ----Removing Stopwords---- 
 ----Removing Diatrics---- 
 ----Removing Punctuation---- 
-----Text Preprocessing Finished-----
-----Tfidf transforming-----
F1 Score ->  0.9085468938415864


array([[1725,   48,   69,   58],
       [  19, 1849,   17,   15],
       [  68,   14, 1652,  166],
       [  70,   19,  131, 1680]], dtype=int64)

In [10]:
predictions=prob_classes

#### Prepare submission

In [11]:
# edit the following code to generate a submission file
submission = pd.DataFrame()
submission['id'] = test_df['id']
submission['category_0'] = predictions[:, 0]
submission['category_1'] = predictions[:, 1]
submission['category_2'] = predictions[:, 2]
submission['category_3'] = predictions[:, 3]
submission.to_csv('submission.csv', index=False)

# Colnclusions

- Write a few words about your solution here. 

I developed a text classification pipeline that includes training a TFIDF model to vectorize text prior to training the linear SVM model. Text processing is crucial to get good results. I did not apply stemming or lemmatization as text quality is good as is and sometimes it can be counterproductive.

- What could be improved? 

Hyperparameter tunning could be done with a wider range of values and the whole dataset. Instead of Grid search, Bayesian Optimization could be used.

- What approaches may work as well for this problem? 

Convolutional Neural Networks, Random Forest, Naive Bayes, vectorizers: Bag of Words, Word embeddings

- What would you implement if you have had more time for this task?

A CNN

- Feel free to write anything you think is relevant to this task :)

 I applied Grid Search for parameter tunning with a subset of data (for time and memory matters) to select parameter 'C'.