# NLP Project Tutorial

**Objetivo de la tarea:** Crear un detector de spam en URL's usando NLP

**Step 1:** Importar y transformar los datos

Primero ejecutar en consola: `pip install -r requirements.txt`

In [72]:
# a pesar de ejecutar requirements, debo ejecutar esto para que funcione
! pip install pandas
! pip install sklearn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [73]:
!pip install -r ../requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2[0m[39;49m -> [0m[32;49m22.2.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [74]:
# importo librerias
import pandas as pd
import pickle
import numpy as np
import re
import unicodedata
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix




Load Data

In [75]:
url = "https://raw.githubusercontent.com/4GeeksAcademy/NLP-project-tutorial/main/url_spam.csv"
df_raw = pd.read_csv(url)

In [76]:
df_raw.head()

Unnamed: 0,url,is_spam
0,https://briefingday.us8.list-manage.com/unsubs...,True
1,https://www.hvper.com/,True
2,https://briefingday.com/m/v4n3i4f3,True
3,https://briefingday.com/n/20200618/m#commentform,False
4,https://briefingday.com/fan,True


In [77]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2999 entries, 0 to 2998
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   url      2999 non-null   object
 1   is_spam  2999 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 26.5+ KB


El dataset contiene 2999 filas (URL's) y 2 columnas: la URL y si es o no es spam

In [78]:
df_raw['is_spam'].value_counts()

False    2303
True      696
Name: is_spam, dtype: int64

In [79]:
# Check duplicates
print('Number of duplicated rows:',df_raw.duplicated().sum())  
df_raw = df_raw.drop_duplicates().reset_index(drop = True)
df_raw['is_spam'].value_counts()
#doubt: the data is more unbalanced 
#clean 452 spam, and only 178 not spam


Number of duplicated rows: 630


False    2125
True      244
Name: is_spam, dtype: int64


PREPROCESS

In [80]:
df = df_raw.copy()

In [81]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2369 entries, 0 to 2368
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   url      2369 non-null   object
 1   is_spam  2369 non-null   bool  
dtypes: bool(1), object(1)
memory usage: 20.9+ KB


In [82]:
def clean_data(urlData):
  
    #remove punctuation, digit, simbols
    urlData = re.sub('[^a-zA-Z]', ' ', urlData)
    
    #duplicate space
    urlData = re.sub(r'\s+', ' ',  urlData)
    #urlData=" ".join(urlData.split())

    urlData = re.sub(r'\b[a-zA-Z]\b', ' ',urlData)  #\b word boundary

    urlData = urlData.strip()   #remove space on right and left include tab
    return urlData


df['url'] = df['url'].str.lower() 
#clean-data
df['url'] = df['url'].apply(clean_data)

#fuction to reove stopwords
stopWord = ['is','you','your','and', 'the', 'to', 'from', 'or', 'I', 'for', 'do', 'get', 'not', 'here', 'in', 'im', 'have', 'on',
're', 'https', 'com', 'of']  

def remove_stopwords(urlData):
  if urlData is not None:
    words = urlData.strip().split()
    words_filtered = []
    for word in words:
      if word not in stopWord:
        words_filtered.append(word)
    result = " ".join(words_filtered) #hace un join elemento por elemento separados por espacio
  else:
      result = None
  return result

df['url'] = df['url'].apply(remove_stopwords)

In [83]:
# varias funciones

def comas(text):
    """
    Elimina comas del texto
    """
    return re.sub(',', ' ', text)

def espacios(text):
    """
    Elimina enters dobles por un solo enter
    """
    return re.sub(r'(\n{2,})','\n', text)

def minuscula(text):
    """
    Cambia mayusculas a minusculas
    """
    return text.lower()

def numeros(text):
    """
    Sustituye los numeros
    """
    return re.sub('([\d]+)', ' ', text)

def caracteres_no_alfanumericos(text):
    """
    Sustituye caracteres raros, no digitos y letras
    Ej. hola 'pepito' como le va? -> hola pepito como le va
    """
    return re.sub("(\\W)+"," ",text)

def comillas(text):
    """
    Sustituye comillas por un espacio
    Ej. hola 'pepito' como le va? -> hola pepito como le va?
    """
    return re.sub("'"," ", text)

def palabras_repetidas(text):
    """
    Sustituye palabras repetidas

    Ej. hola hola, como les va? a a ustedes -> hola, como les va? a ustedes
    """
    return re.sub(r'\b(\w+)( \1\b)+', r'\1', text)

def esp_multiple(text):
    """
    Sustituye los espacios dobles entre palabras
    """
    return re.sub(' +', ' ',text)



#df['texto_limpio'] = df['texto'].apply(espacios).apply(comas).apply(url).apply(minuscula).apply(esp_multiple).apply(comillas)

#df['texto_limpio'].values[:]




In [84]:
'/'
'.'
'  '


'  '

In [85]:
# funcón para eliminar https
def url(text):
    return re.sub(r'(https://www|https://)', '', text)

In [86]:
# se limpia url
df['url_limpia'] = df['url'].apply(url).apply(caracteres_no_alfanumericos).apply(esp_multiple)

In [87]:
df.head()

Unnamed: 0,url,is_spam,url_limpia
0,briefingday us list manage unsubscribe,True,briefingday us list manage unsubscribe
1,www hvper,True,www hvper
2,briefingday,True,briefingday
3,briefingday commentform,False,briefingday commentform
4,briefingday fan,True,briefingday fan


In [88]:
df['is_spam'] = df['is_spam'].apply(lambda x: 1 if x == True else 0)

In [89]:
df.head()

Unnamed: 0,url,is_spam,url_limpia
0,briefingday us list manage unsubscribe,1,briefingday us list manage unsubscribe
1,www hvper,1,www hvper
2,briefingday,1,briefingday
3,briefingday commentform,0,briefingday commentform
4,briefingday fan,1,briefingday fan


**Step 2:** Usar técnicas de NLP para preprocesamiento de datos

In [90]:
vec = CountVectorizer().fit_transform(df['url_limpia'])

In [91]:
X_train, X_test, y_train, y_test = train_test_split(vec, df['is_spam'], stratify = df['is_spam'], random_state = 2207)

**Step 3:** Utilizar SVM para construir un clasificador de spam en URL's

In [92]:
classifier = SVC(C = 1.0, kernel = 'linear', gamma = 'auto')

In [93]:
classifier.fit(X_train, y_train)
predictions = classifier.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.95      0.96      0.96       532
           1       0.64      0.59      0.62        61

    accuracy                           0.92       593
   macro avg       0.80      0.78      0.79       593
weighted avg       0.92      0.92      0.92       593



In [94]:
# optimizo hiperparámetros
param_grid = {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}

grid = GridSearchCV(SVC(random_state=1234),param_grid,verbose=2)
grid.fit(X_train,y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.2s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.2s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.2s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.2s
[CV] END .........................C=0.1, gamma=1, kernel=rbf; total time=   0.2s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.2s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.2s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.2s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.2s
[CV] END ........................C=0.1, gamma=1, kernel=poly; total time=   0.2s
[CV] END .....................C=0.1, gamma=1, kernel=sigmoid; total time=   0.0s
[CV] END .....................C=0.1, gamma=1, k

In [95]:
grid.best_params_

{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}

In [96]:
grid.best_estimator_

In [97]:
predictions = grid.best_estimator_.predict(X_test)
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.95      0.98      0.96       532
           1       0.76      0.52      0.62        61

    accuracy                           0.93       593
   macro avg       0.85      0.75      0.79       593
weighted avg       0.93      0.93      0.93       593



probar

In [98]:
#### MODEL

X = df['url']
y = df['is_spam'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 42)

#Vectorizador
vec = CountVectorizer()

#create matrix
X_train = vec.fit_transform(X_train).toarray()
X_test = vec.transform(X_test).toarray()

#create the model using SVC
svclassifier = SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
svclassifier.fit(X_train, y_train)

#save the model to file
filename = '../models/svc_model.sav' #use absolute path
pickle.dump(svclassifier, open(filename, 'wb'))

**Step 4:** Crear app.py con el código relevamente y detallar procedimiento en README.md