## Feature Engeneering

Este notebook contém a etapa de feature engineering dos conjuntos de treino e teste. Essa etapa se baseia no pré-processamento da etapa de Exploratory Analisys, realizada anteriormente.

Ao final dessa etapa, serão gerados dois conjuntos de dados de treino e teste processados, que serão utilizados para treinamento e teste dos modelos de classificação.

#### Importação de bibliotecas:

In [1]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize as TK

import nltk
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

#### Importação dos datasets:

In [2]:
# read datasets
train = pd.read_csv('dataset/train.csv', encoding='utf-8')
test = pd.read_csv('dataset/test.csv', encoding='utf-8')

In [3]:
print("Existem {} exemplo, no dataset de treinamento. Cada exemplo contém {} atributos, sendo: "
      .format(train.shape[0],train.shape[1]) + ", ".join(train.columns) + '.')

Existem 20800 exemplo, no dataset de treinamento. Cada exemplo contém 5 atributos, sendo: id, title, author, text, label.


In [4]:
# visualização das 5 primeiras notícias do conjunto de treino
train.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \r\nAn Iranian woman has been sentenced ...,1


In [5]:
print("Existem {} exemplo, no dataset de treinamento. Cada exemplo contém {} atributos, sendo: "
      .format(test.shape[0],test.shape[1]) + ", ".join(train.columns) + '.')

Existem 5200 exemplo, no dataset de treinamento. Cada exemplo contém 4 atributos, sendo: id, title, author, text, label.


In [6]:
# visualização das 5 primeiras notícias do conjunto de teste
test.head()

Unnamed: 0,id,title,author,text
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning..."
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different..."
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...


### Tratamento dos dados
Nessa etapa serão realizados alguns tratamentos nos dados de treinamento e teste:
* Tratamento de missing values
* Conversão de todas as palavras para letras minúsculas
* Remoção de caracteres númericos e especiais
* Remoção de stopwords
* Lematização das palavras

##### Tratamento de missing values

In [7]:
# remover exemplos com string nula
train = train[~train['text'].isnull()]
train.shape
test = test[~test['text'].isnull()]
test.shape

# preenche títulos nulos com string vazia
train['title'].fillna('',inplace=True)
test['title'].fillna('',inplace=True)

##### Conversão para minúsculas

In [8]:
train['processed_text'] = train['text'].str.lower()
train['processed_title'] = train['title'].str.lower()

test['processed_text'] = test['text'].str.lower()
test['processed_title'] = test['title'].str.lower()

##### Remoção de números, caracteres especiais
É necessário realizar uma limpeza no texto para igualar palavras iguais que podem ter sido escritas de forma incorreta ou diferente, palavras que não tem sentido sozinhos como números. Para isso faremos:
- remoção de caracteres numéricos
- remoção de caracteres especiais e pontuação
- remoção de acentos, hífens, apóstrofo

In [9]:
# retira os caracteres numéricos 
train['processed_text'] = train['processed_text'].str.replace('[0-9]', '',regex=True)
train['processed_title'] = train['processed_title'].str.replace('[0-9]', '',regex=True)

test['processed_text'] = test['processed_text'].str.replace('[0-9]', '',regex=True)
test['processed_title'] = test['processed_title'].str.replace('[0-9]', '',regex=True)

In [10]:
# retira os caracteres especiais e pontuação
train['processed_text'] = train['processed_text'].str.replace('[?_!%&/.,®€™():]', '',regex=True)
train['processed_title'] = train['processed_title'].str.replace('[?_!%&/.,®€™():]', '',regex=True)
train['processed_text'] = train['processed_text'].str.replace("\\r", "").str.replace("\\n","")
train['processed_title'] = train['processed_title'].str.replace("\\r", "").str.replace("\\n","")

test['processed_text'] = test['processed_text'].str.replace('[?_!%&/.,®€™():]', '',regex=True)
test['processed_title'] = test['processed_title'].str.replace('[?_!%&/.,®€™():]', '',regex=True)
test['processed_text'] = test['processed_text'].str.replace("\\r", "").str.replace("\\n","")
test['processed_title'] = test['processed_title'].str.replace("\\r", "").str.replace("\\n","")

In [11]:
# remove hífen e apóstrofo, adicionando espaço no lugar
train['processed_text'] = train['processed_text'].str.replace('[-“”[—‘’\']', ' ')
train['processed_title'] = train['processed_title'].str.replace('[-“”[—‘’\']', ' ')

test['processed_text'] = test['processed_text'].str.replace('[-“”[—‘’\']', ' ')
test['processed_title'] = test['processed_title'].str.replace('[-“”[—‘’\']', ' ')

In [12]:
# remoção de acentos
train['processed_text'] = train['processed_text'].str.replace('â', "a").str.replace('è', 'e').str.replace('é', 'e').str.replace('í', 'i').str.replace('î', 'i').str.replace('ú', 'u').str.replace('ç', 'c')
train['processed_title'] = train['processed_title'].str.replace('â', "a").str.replace('è', 'e').str.replace('é', 'e').str.replace('í', 'i').str.replace('î', 'i').str.replace('ú', 'u').str.replace('ç', 'c')

test['processed_text'] = test['processed_text'].str.replace('â', "a").str.replace('è', 'e').str.replace('é', 'e').str.replace('í', 'i').str.replace('î', 'i').str.replace('ú', 'u').str.replace('ç', 'c')
test['processed_title'] = test['processed_title'].str.replace('â', "a").str.replace('è', 'e').str.replace('é', 'e').str.replace('í', 'i').str.replace('î', 'i').str.replace('ú', 'u').str.replace('ç', 'c')

##### Remoção de stop words

In [13]:
# lista de stopwords em inglês
stop_words = set(stopwords.words('english')) 

train['processed_text'] = train['processed_text'].apply(lambda x: [item for item in x.split(' ') if item not in stop_words])
train['processed_title'] = train['processed_title'].apply(lambda x: [item for item in x.split(' ') if item not in stop_words])

test['processed_text'] = test['processed_text'].apply(lambda x: [item for item in x.split(' ') if item not in stop_words])
test['processed_title'] = test['processed_title'].apply(lambda x: [item for item in x.split(' ') if item not in stop_words])

##### Lematização

In [14]:
lemmatizer = nltk.stem.WordNetLemmatizer()
train['processed_text'] = train['processed_text'].apply(lambda x:  [lemmatizer.lemmatize(w) for w in x])
train['processed_title'] = train['processed_title'].apply(lambda x:  [lemmatizer.lemmatize(w) for w in x])

test['processed_text'] = test['processed_text'].apply(lambda x:  [lemmatizer.lemmatize(w) for w in x])
test['processed_title'] = test['processed_title'].apply(lambda x:  [lemmatizer.lemmatize(w) for w in x])

##### Remoção de palavras vazias (espaços em branco)

In [15]:
train['processed_text'] = train['processed_text'].apply(lambda x: [a.strip() for a in x if a.strip()])
train['processed_title'] = train['processed_title'].apply(lambda x: [a.strip() for a in x if a.strip()])

test['processed_text'] = test['processed_text'].apply(lambda x: [a.strip() for a in x if a.strip()])
test['processed_title'] = test['processed_title'].apply(lambda x: [a.strip() for a in x if a.strip()])

##### Exclusão das notícias repetidas

In [16]:
# cria variável de texto, com palavras separadas por vírgula
train["words_text"] = train["processed_text"].apply(lambda x: ", ".join(x))
test["words_text"] = test["processed_text"].apply(lambda x: ", ".join(x))

# removendo casos duplicados
train.drop_duplicates(subset=['words_text'], keep='first', inplace=True)
train.shape
test.drop_duplicates(subset=['words_text'], keep='first', inplace=True)
test.shape

(5118, 7)

### Criação de novas features

Uma feature foi criada na etapa anterior e contém: a lista de palavras de cada notícia concatenadas em apenas uma string, separadas por vírgula. Essa mesma variável será criada para o campo título.
Com base na coluna de texto e título processado, será criada, para cada coluna, uma nova feature com a quantidade de palavras de cada texto e cada título

In [17]:
train["words_title"] = train["processed_title"].apply(lambda x: ", ".join(x))
test["words_title"] = test["processed_title"].apply(lambda x: ", ".join(x))

train['processed_word_count'] = train['processed_text'].apply(lambda x: len(x))
train['processed_title_word_count'] = train['processed_title'].apply(lambda x: len(x))

test['processed_word_count'] = test['processed_text'].apply(lambda x: len(x))
test['processed_title_word_count'] = test['processed_title'].apply(lambda x: len(x))

In [18]:
# removendo textos que ficaram vazios após o tratamento e limpeza
train = train[train['processed_word_count']!=0]

### Datasets após o tratamento

In [19]:
print(f"O conjunto de treinamento processado possui {train.shape[0]} exemplos e \
{train.shape[1]} colunas:") 

O conjunto de treinamento processado possui 20325 exemplos e 11 colunas:


In [20]:
# visualização das cinco primeira notícias presentes no conjunto de treino
train.head()

Unnamed: 0,id,title,author,text,label,processed_text,processed_title,words_text,words_title,processed_word_count,processed_title_word_count
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1,"[house, dem, aide, even, see, comey, letter, j...","[house, dem, aide, even, see, comey, letter, j...","house, dem, aide, even, see, comey, letter, ja...","house, dem, aide, even, see, comey, letter, ja...",434,10
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0,"[ever, get, feeling, life, circle, roundabout,...","[flynn, hillary, clinton, big, woman, campus, ...","ever, get, feeling, life, circle, roundabout, ...","flynn, hillary, clinton, big, woman, campus, b...",367,7
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1,"[truth, might, get, fired, october, tension, i...","[truth, might, get, fired]","truth, might, get, fired, october, tension, in...","truth, might, get, fired",704,4
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1,"[video, civilian, killed, single, u, airstrike...","[civilian, killed, single, u, airstrike, ident...","video, civilian, killed, single, u, airstrike,...","civilian, killed, single, u, airstrike, identi...",307,6
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \r\nAn Iranian woman has been sentenced ...,1,"[print, iranian, woman, sentenced, six, year, ...","[iranian, woman, jailed, fictional, unpublishe...","print, iranian, woman, sentenced, six, year, p...","iranian, woman, jailed, fictional, unpublished...",89,10


In [21]:
print(f"O conjunto de teste processado possui {test.shape[0]} exemplos e \
{test.shape[1]} colunas:") 

O conjunto de teste processado possui 5118 exemplos e 10 colunas:


In [22]:
# visualização das cinco primeira notícias presentes no conjunto de treino
test.head()

Unnamed: 0,id,title,author,text,processed_text,processed_title,words_text,words_title,processed_word_count,processed_title_word_count
0,20800,"Specter of Trump Loosens Tongues, if Not Purse...",David Streitfeld,"PALO ALTO, Calif. — After years of scorning...","[palo, alto, calif, year, scorning, political,...","[specter, trump, loosens, tongue, purse, strin...","palo, alto, calif, year, scorning, political, ...","specter, trump, loosens, tongue, purse, string...",765,11
1,20801,Russian warships ready to strike terrorists ne...,,Russian warships ready to strike terrorists ne...,"[russian, warship, ready, strike, terrorist, n...","[russian, warship, ready, strike, terrorist, n...","russian, warship, ready, strike, terrorist, ne...","russian, warship, ready, strike, terrorist, ne...",152,7
2,20802,#NoDAPL: Native American Leaders Vow to Stay A...,Common Dreams,Videos #NoDAPL: Native American Leaders Vow to...,"[video, #nodapl, native, american, leader, vow...","[#nodapl, native, american, leader, vow, stay,...","video, #nodapl, native, american, leader, vow,...","#nodapl, native, american, leader, vow, stay, ...",430,10
3,20803,"Tim Tebow Will Attempt Another Comeback, This ...",Daniel Victor,"If at first you don’t succeed, try a different...","[first, succeed, try, different, sport, tim, t...","[tim, tebow, attempt, another, comeback, time,...","first, succeed, try, different, sport, tim, te...","tim, tebow, attempt, another, comeback, time, ...",355,10
4,20804,Keiser Report: Meme Wars (E995),Truth Broadcast Network,42 mins ago 1 Views 0 Comments 0 Likes 'For th...,"[min, ago, view, comment, like, first, time, h...","[keiser, report, meme, war, e]","min, ago, view, comment, like, first, time, hi...","keiser, report, meme, war, e",51,5


## TF-IDF e exportação dos datasets processados

#### Salvar os valores de target em um novo arquivo

In [23]:
y_train = train['label']
pd.DataFrame(y_train, columns=['target']).to_csv("dataset/train_target.csv", index=False)

#### TF-IDF

Como os modelos preditivos não trabalham diretamente com textos, é necessário realizar a vetorização da lista de palavras das notícias. Será utilizado o método TF-IDF para isso:

In [24]:
def tf_idf(df):
    df = df.reset_index(drop=True)
    vectorizer = make_pipeline(
            TfidfVectorizer(binary=True),
            FunctionTransformer(lambda x: x.astype('float'), validate=False)
        )
    tfidf = vectorizer.fit_transform(df['words_text'])
    tfidf = vectorizer.fit_transform(df['words_title'])
    tfidf = pd.DataFrame(tfidf.toarray())
    print (tfidf.shape)
    return tfidf

In [25]:
tfidf_train = tf_idf(train)
X_train = pd.concat([train, tfidf_train],axis=1)
X_train.to_csv("../dataset/processed_train.csv", index=False)

ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size.

In [None]:
tfidf_test = tf_idf(test)
X_test = pd.concat([train, tfidf_test],axis=1)
X_test.to_csv("../dataset/processed_test.csv", index=False)