<h1> Text Classification</h1>
There are three main types of classification:

>> 1) Binary: Two mutually exclusive categories (e.g., Spam detection)

>> 2)Multiclass: More than 2 mutually exclusive categories (e.g., Language detection)

>> 3) Multilabel: Non-mutually exclusive categories (e.g., movie genres)

<h1> Binary text classification problem</h1> 
>> We will address the binary problem of detecting Sport related documents vs any other type of documents. In order to do this we will create an artificial (and very small collection).

>>1) Define a set of labelled documents that will be our training dataset. These are the documents the classifier will learn from in order to categorise future unseen documents

>> 2) Define a set of labelled documents that will be our testing dataset. These will be the "unseen" documents that the classifier will predict (without having being trained with them)

>> 3) Represent our training and testing documents

>> 4) Train the classifier based on the training data

>> 5) Predict the labels for the testing documents

In [0]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

In [0]:
# Train and test data. Both the full documents and their labels ("Sports" vs "Non Sports")
train_data = ['Football: a great sport', 'The referee has been very bad this season', 'Our team scored 5 goals', 'I love tennis',
              'Politics is in decline in the UK', 'Brexit means Brexit', 'The parlament wants to create new legislation',
              'I so want to travel the world']
train_labels = ["Sports","Sports","Sports","Sports", "Non Sports", "Non Sports", "Non Sports", "Non Sports"]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tennis team will travel to the UK soon for the European Championship']
test_labels = ["Sports","Non Sports","Sports"]



In [0]:
print('feature {},label {}'.format(train_data[1],train_labels[1]))

feature The referee has been very bad this season,label Sports


In [0]:
# Representation of the data using TF-IDF
vectorizer = TfidfVectorizer()
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)

In [0]:
vectorised_train_data[1].toarray()

array([[0.3675562 , 0.3675562 , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.3675562 , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.3675562 , 0.        ,
        0.3675562 , 0.        , 0.        , 0.        , 0.        ,
        0.23306022, 0.3675562 , 0.        , 0.        , 0.        ,
        0.3675562 , 0.        , 0.        , 0.        ]])

In [0]:
# Train the classifier given the training data
classifier = LinearSVC()
classifier.fit(vectorised_train_data, train_labels)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)

In [0]:
# Predict the labels for the test documents (not used for training)
print(classifier.predict(vectorised_test_data))

['Sports' 'Non Sports' 'Non Sports']


<h1> Navie Bay</h1>

In [0]:
from sklearn.naive_bayes import GaussianNB

In [0]:
model = GaussianNB()

model.fit(vectorised_train_data.toarray(), train_labels)

#Predict Output 
predicted= model.predict(vectorised_test_data.toarray())
print (predicted)


['Sports' 'Non Sports' 'Non Sports']


<h1> Lets try again, with stop-word removal this time</h1>

>>  # it is observed the remova of STOP words give good accuracy

In [0]:
from nltk.corpus import stopwords

def text_processing_weithoutSTOPWORD():
  stop_words = stopwords.words("english")
  vectorizer = TfidfVectorizer(stop_words=stop_words)
  vectorised_train_data = vectorizer.fit_transform(train_data)
  vectorised_test_data = vectorizer.transform(test_data)
  classifier = LinearSVC()
  classifier.fit(vectorised_train_data, train_labels)
  print(classifier.predict(vectorised_test_data))
  
  


In [0]:
text_processing_weithoutSTOPWORD()

['Sports' 'Non Sports' 'Sports']


<h1> multiclass classification</h1>

>> We will address the multi-class problem of detecting the language of a sentence based on 3 mutually exclusive languages (e.g., Spanish, English and French). For the sake of this example, we assume those are the only 3 languages that the documents can have. As before, we will create an artificial (and very small collection) with similar steps

In [0]:
def multiClass_prediction(train_data,train_labels,test_data):
  vectorizer = TfidfVectorizer() # Note, we are not doing stop-word removal. Stop words could be beneficial in this problems
  vectorised_train_data = vectorizer.fit_transform(train_data)
  vectorised_test_data = vectorizer.transform(test_data)
  
  classifier = LinearSVC()
  classifier.fit(vectorised_train_data, train_labels)
  predictions = classifier.predict(vectorised_test_data)
  print(predictions)
  
  
  
  
  
  












In [0]:
train_data = ['PyCon es una gran conferencia', 'Aprendizaje automatico esta listo para dominar el mundo dentro de poco',
             'This is a great conference with a lot of amazing talks', 'AI will dominate the world in the near future',
             'Dix chiffres por resumer le feuilleton de la loi travail']
train_labels = ["SP", "SP", "EN", "EN", "FR"]

test_data = ['Estoy preparandome para dominar las olimpiadas', 'Me gustaria mucho aprender el lenguage de programacion Scala',
             'Machine Learning is amazing','Hola a todos']
test_labels = ["SP", "SP", "EN", "SP"]

multiClass_prediction(train_data,train_labels,test_data)

['SP' 'SP' 'EN' 'EN']


>> # the previous prediction is not correct (expected "SP", prediction is "EN") 


## below code for good Accuracy of Language modeling 

In [0]:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.multiclass import OneVsRestClassifier

In [0]:
# Artificial (and small) dataset. Sports and Politics

train_data = ['Football: a great sport', 'The referee has been very bad this season', 'Our team scored 5 goals', 'I love tennis',
              'Politics is in decline in the UK', 'Brexit means Brexit', 'The parlament wants to create new legislation',
              'I so want to travel the world', 
              'The goverment will increase the budget for sports in the UK after the victories in the Olimpic Games',
              "O'Reilly has a great conference this year"]

train_labels = [["Sports"], ["Sports"], ["Sports"], ["Sports"],["Politics"],["Politics"],["Politics"],[],["Politics", "Sports"],[]]

test_data = ['Swimming is a great sport', 
             'A lot of policy changes will happen after Brexit', 
             'The table tennis team will travel to the UK soon for the European Championship',
             'The goverment will increase the budget for sports in the UK after the victories in the Olimpic Games',
             'PyCon is my favourite conference'] 

test_labels = [["Sports"], ["Politics"], ["Sports"], ["Politics","Sports"],[]] 

# Change the representation of our data as a list of bit lists 
mlb = MultiLabelBinarizer()
binary_train_labels = mlb.fit_transform(train_labels)
binary_test_labels = mlb.transform(test_labels)

print(binary_train_labels)

[[0 1]
 [0 1]
 [0 1]
 [0 1]
 [1 0]
 [1 0]
 [1 0]
 [0 0]
 [1 1]
 [0 0]]


In [0]:
# Represent 
vectorizer = TfidfVectorizer(stop_words=stop_words)
vectorised_train_data = vectorizer.fit_transform(train_data)
vectorised_test_data = vectorizer.transform(test_data)

# One classifer built per category using a one vs the rest approach
classifier = OneVsRestClassifier(LinearSVC())
classifier.fit(vectorised_train_data, binary_train_labels)

#Predict
predictions = classifier.predict(vectorised_test_data)

print(predictions)
print()

print(mlb.inverse_transform(predictions))

[[0 1]
 [1 0]
 [0 1]
 [1 1]
 [0 0]]

[('Sports',), ('Politics',), ('Sports',), ('Politics', 'Sports'), ()]
