# Text Classification Problem

We expect a candidate to develop a solution that is capable to classify provided texts in one of **four** classes.
 
You may find the dataset in the **data** folder:
- train.csv contains training dataset. There are four columns in this file:
    - id - column with unique identifier of each data sample
    - category - target variable
    - title - document title
    - description - document text
- test.csv contains test dataset and all the columns are the same except category as it is unknown and should be predicted.
- sample_submission.csv - an example of how resulting submission shoul look like.

Your model should give as an output a probability of each sample belonging to each class.

To submit your solution put this **solution.ipynb** file and generated **submission.csv** in a **zip** file.

We are interested to see how candidate implements his/her typical pipeline to solve machine learning problems starting with a dataset containing both data and target variable.

We **do not** expect a state-of-the-art solution here, rather a code that demonstrates candidate's understanding of crucial parts in ML models development. However, it would be a plus to see a brief description on how to get to the near-state-of-the-art solution in conclusions.

#### Imports

In [1]:
import numpy as np
import pandas as pd
from utils import ReplaceDiatrics,PunctRemove,Tfidf_fit,Tfidf_transform,PreprocessText
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import pickle
from sklearn import svm
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import auc, roc_curve

ModuleNotFoundError: No module named 'utils'

#### Your solution

In [3]:
# put your code in this and the following blocks

In [3]:
test_data = 100

snoopes=pd.read_csv("data/snoopes.csv")
snoopes_test = snoopes.iloc[:test_data,:]
snoopes = snoopes.iloc[test_data+1:,:]

normal=pd.read_csv("data/normal.csv")
normal_test = normal.iloc[:test_data,:]
normal = normal.iloc[test_data+1:,:]

true_data=pd.read_csv("data/True.csv")
fake_data=pd.read_csv("data/Fake.csv")

del true_data['subject']
del true_data['date']
true_data['label'] = True
true_data_test = true_data.iloc[:test_data,:]
true_data = true_data.iloc[test_data+1:,:]

del fake_data['subject']
del fake_data['date']
fake_data['label'] = False
fake_data_test = fake_data.iloc[:test_data,:]
fake_data = fake_data.iloc[test_data+1:,:]

df = snoopes.append(normal).append(true_data).append(fake_data)
test_df = snoopes_test.append(normal_test).append(true_data_test).append(fake_data_test)

#check dataset balance
df['label'].value_counts()

False    31652
True     26170
Name: label, dtype: int64

In [5]:
#we will work with both texts together
df["text_0"]=df["title"]+df["text"]

processed_df=PreprocessText(df)
processed_df.head()

X,y=processed_df.text_0,processed_df.label

-----Text Preprocessing Started-----
 ----Removing Stopwords---- 
 ----Removing Diatrics---- 
 ----Removing Punctuation---- 
-----Text Preprocessing Finished-----


In [6]:
#train vectorizer and transform text
Tfidf_fit(X)
X=Tfidf_transform(X)

-----Tfidf fitting-----
-----Tfidf transforming-----


In [7]:
#split into training and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [8]:
#select best hyperparameters

SVM = SVC(kernel = 'linear', random_state = 0)
'''
SVM = Pipeline([('clf', SVC(random_state=1,C=10.0,probability=True))])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range,
               'clf__kernel': ['linear']},
              {'clf__C': param_range,
               'clf__gamma': param_range,
               'clf__kernel': ['rbf']}]

gs = GridSearchCV(estimator=SVM,
                  param_grid=param_grid,
                  scoring='f1_macro',
                  cv=10,
                  n_jobs=1)
gs = gs.fit(X_train, y_train)

print('--> Best score: ',gs.best_score_)
print('--> Best parameters: \n',gs.best_params_)


#Select best parameters

SVM = gs.best_estimator_
'''
#final values
SVM.fit(X_train, y_train)

# predict labels on validation set
predictions_SVM = SVM.predict(X_test)

# Use f1 score function 
print("F1 Score -> ",f1_score(y_test, predictions_SVM, average='macro'))


F1 Score ->  0.9728755687569889


In [9]:
# AUC metric
fpr, tpr, thresholds = roc_curve(y_test, predictions_SVM, pos_label=True)
print("AUC Score -> ", auc(fpr, tpr))

AUC Score ->  0.9718490983664321


In [10]:
# Confussion matrix
cm5 = confusion_matrix(y_test, predictions_SVM)
print(cm5)

[[6255   91]
 [ 219 5000]]


In [11]:
##store model
pickle.dump(SVM, open('svm.sav', 'wb'))

#### Model testing

In [12]:
# TEST DATA
# fake_data=pd.read_csv("data/Fake.csv")
# true_data=pd.read_csv("data/True.csv")
# fake_data['label'] = False
# true_data['label'] = True
# test_df = fake_data.append(true_data)

# test_df=pd.read_csv("data/snoopes.csv")
# test_df=pd.read_csv("data/test.csv")

# test_df = pd.DataFrame(columns=['title', 'text', 'label'])
# obj = {}
# obj['title']="For vaccine rates among Americans 65 and older, “there’s virtually no difference between white, Black, Hispanic, Asian American.”"
# obj['text']="During May 3 remarks on the American Families Plan, President Joe Biden boasted that there was not much disparity in the Covid-19 vaccination rates for white Americans and Americans of color who are at least 65.\n\n\"And what’s happening now is all the talk about how people were not going to get shots, they were not going to be involved — look at what that was — we were told that was most likely to be among people over 65 years of age,\" said Biden. \"But now people over 65 years of age, over 80%, have now been vaccinated, and 66% fully vaccinated. And there’s virtually no difference between white, Black, Hispanic, Asian American.\"\n\nThis isn’t the only time that Biden has made the claim.\n\nHe went even further on April 27 during remarks on the Covid-19 response: \"And, by the way, based on reported data, the proportion — the proportion of seniors who have been vaccinated is essentially equal between white and seniors of color. … As a matter of a fact, if I’m not mistaken, there are more Latinos and African American seniors that have been vaccinated, as a percentage, than white seniors.\"\n\nHowever, the national data that Biden keeps touting — vaccination statistics regarding both race and age — is not public. We asked the White House for the information underlying this claim, but officials did not provide specifics.\n\nSo, we moved on to the Centers for Disease Control and Prevention. Spokesperson Chandra Zeikel told KHN-PolitiFact on May 6 that \"unfortunately, we don’t have available a data breakdown of both racial demographics and age together.\" Zeikel didn’t respond to a follow-up question asking when or if the CDC would be publishing this data, but current CDC vaccination data is broken down only by race/ethnicity and shows significant differences, with white Americans far outpacing the percentage of other groups getting a shot. It also shows that the rate of vaccinations among some groups, including Black and Latino Americans, does not match their share of the population, though new CDC data shows there has been some progress on this front in the last two weeks.\n\nThat made us wonder about the premise of Biden’s statement. We turned to experts for their take.\n\n\"As far as I know, there is no comprehensive publicly available data on vaccination rates by race/ethnicity and age,\" Samantha Artiga, vice president and director of the racial equity and health policy program at KFF, wrote in an email. \"As such, we are not able to assess whether there are racial disparities in vaccinations among people over 65 years of age.\"\n\nWhat about other state-level data or anecdotes that might support Biden’s claim? Let’s dive in and see."
# obj['label']=False
# test_df = test_df.append(obj, ignore_index=True)

test_df['label'].value_counts()

test_df["text_0"]=test_df["title"]+test_df["text"]

processed_test_df=PreprocessText(test_df)
X_t=processed_test_df.text_0
X_t=Tfidf_transform(X_t)

svm_model=pickle.load(open('svm.sav', 'rb'))

predicted_cat=svm_model.predict(X_t)
test_df["pred_label"]=predicted_cat

print("F1 Score -> ",f1_score(test_df["label"], test_df["pred_label"], average='macro'))
confusion_matrix(test_df["label"],test_df["pred_label"])

-----Text Preprocessing Started-----
 ----Removing Stopwords---- 
 ----Removing Diatrics---- 
 ----Removing Punctuation---- 
-----Text Preprocessing Finished-----
-----Tfidf transforming-----
F1 Score ->  0.9123021316846135


array([[173,  35],
       [  0, 192]])

In [10]:
predictions=prob_classes

#### Prepare submission

In [11]:
# edit the following code to generate a submission file
submission = pd.DataFrame()
submission['id'] = test_df['id']
submission['category_0'] = predictions[:, 0]
submission['category_1'] = predictions[:, 1]
submission['category_2'] = predictions[:, 2]
submission['category_3'] = predictions[:, 3]
submission.to_csv('submission.csv', index=False)

# Colnclusions

- Write a few words about your solution here. 

I developed a text classification pipeline that includes training a TFIDF model to vectorize text prior to training the linear SVM model. Text processing is crucial to get good results. I did not apply stemming or lemmatization as text quality is good as is and sometimes it can be counterproductive.

- What could be improved? 

Hyperparameter tunning could be done with a wider range of values and the whole dataset. Instead of Grid search, Bayesian Optimization could be used.

- What approaches may work as well for this problem? 

Convolutional Neural Networks, Random Forest, Naive Bayes, vectorizers: Bag of Words, Word embeddings

- What would you implement if you have had more time for this task?

A CNN

- Feel free to write anything you think is relevant to this task :)

 I applied Grid Search for parameter tunning with a subset of data (for time and memory matters) to select parameter 'C'.