# Support Vector Machines(SVM)
**Support vector machines (SVM)** is a type of supervised learning method. They are used for classification, regression, and outlier detection.

Support vector machines have these advantages:

- It works well in high-dimensional spaces.
- It works well even when there are more dimensions than samples.
- It uses a smaller set of training points in the decision function (called support vectors), so it doesn't need as much memory.
- Versatile: You can choose different functions for the decision process. Common kernels are provided, but it is also possible to specify custom kernels.


Support vector machines have the following disadvantages:

- If there are a lot more features than samples, it's very important to avoid overfitting when choosing kernel functions and the regularization term.
- SVMs don't give probability estimates directly. These are computed using a five-step cross-validation process that costs a lot of money.

In [1]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np


In [3]:
import nltk
#nltk.download('punkt')
from nltk.tokenize import TweetTokenizer
def get_texts_from_file(path_corpus,path_thruth):
    """
    Reads a corpus and its correspondig labels from files.
    Args:
        path_corpus (.txt): Path to the corpus file.
        path_thruth (.txt): Path to the labels file.
    Returns:
        tr_txt (list): List of tweets from the corpus.
        tr_y (list): List of labels corresponding to the tweets.
    """
    tr_txt=[]
    tr_y=[]    
    with open(path_corpus, 'r') as f_corpus, open(path_thruth, 'r') as f_thruth:
        for twitt in f_corpus:
            tr_txt.append(twitt) 
        for label in f_thruth:
            tr_y.append(int(label)) 
    return tr_txt,tr_y
tr_txt,tr_y=get_texts_from_file('./mex20_train.txt','./mex20_train_labels.txt')
tokenizer=TweetTokenizer()


In [None]:
def sortFreqDict(freqdict):
    aux = [(freqdict[key], key) for key in freqdict]
    aux.sort()
    aux.reverse()
    return aux
corpus_palabras = [token for doc in tr_txt for token in tokenizer.tokenize(doc)]

fdist = nltk.FreqDist(corpus_palabras)
V = sortFreqDict(fdist)  
V1=V[:5000]
dict_indices1= dict()
cont=0
for width,word in V1:
    dict_indices1[word]=cont
    cont+=1


def build_bow_tr(tr_txt,V,dic_indices,mode="binary"): 
    BoW=np.zeros((len(tr_txt),len(V)),dtype=int) 
    cont_doc=0
    for i, doc in enumerate(tr_txt):
        
        fdist = nltk.FreqDist(tokenizer.tokenize(doc))  
        total_words = sum(fdist.values())
        for word, freq in fdist.items():
            if word in dic_indices:
                index = dic_indices[word]
                if mode == "binary":
                    BoW[i, index] = 1
                elif mode == "freq":
                    BoW[i, index] = freq/total_words
        
    return BoW
BoW_tr=build_bow_tr(tr_txt,V1,dict_indices1)
BoW_tr.shape

(5278, 5000)

In [5]:
val_txt,val_y=get_texts_from_file('./mex20_val.txt','./mex20_val_labels.txt')
val_y=list(map(int,val_y))
Bow_val=build_bow_tr(val_txt,V1,dict_indices1)

In [6]:

tr_y = list(map(int, tr_y))
val_y = list(map(int, val_y))

parameters = {'C': [0.12, 0.25, 0.5, 1, 2, 4]}
clf_svm = LinearSVC(class_weight='balanced', dual=False)

# Grid search with cross-validation
grid = GridSearchCV(estimator=clf_svm,
                    param_grid=parameters,
                    scoring='f1_weighted', 
                    cv=5,
                    n_jobs=-1,  
                    verbose=1)


grid.fit(BoW_tr, tr_y)



Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [10]:

val_txt, val_y = get_texts_from_file('./mex20_val.txt','./mex20_val_labels.txt')
val_y = list(map(int, val_y))
BoW_val = build_bow_tr(val_txt, V1, dict_indices1)

best_svm = grid.best_estimator_
predicciones = best_svm.predict(BoW_val)


print("\nReport of classification:")
print(classification_report(val_y, predicciones))





Report of classification:
              precision    recall  f1-score   support

           0       0.87      0.86      0.86       418
           1       0.66      0.67      0.66       169

    accuracy                           0.80       587
   macro avg       0.76      0.77      0.76       587
weighted avg       0.81      0.80      0.80       587

