<a href="https://colab.research.google.com/github/lucsampaioime/Masters-UnB/blob/nmf/NMF_practice_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Implementação em Python com Pandas e ScikitLearn

In [28]:
import pandas as pd

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
from sklearn.feature_extraction import text

from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer


import re
import string

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /root/nltk_data...


Importando dados de discursos de Presidentes americanos do Kaggle

In [19]:
# Expandindo largura de display das colunas para facilitar a inspeção
pd.set_option('max_colwidth', 150)

# Leitura do csv para o dataframe
df = pd.read_csv('inaug_speeches.csv', encoding='cp1252', engine='python')

# Visualizando o dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Inaugural Address,Date,text
0,4,George Washington,First Inaugural Address,"Thursday, April 30, 1789",Fellow-Citizens of the Senate and of the House of Representatives: AMONG the vicissitudes incident to life no event could have fille...
1,5,George Washington,Second Inaugural Address,"Monday, March 4, 1793",Fellow Citizens: I AM again called upon by the voice of my country to execute the functions of its Chief Magistrate. When the occas...
2,6,John Adams,Inaugural Address,"Saturday, March 4, 1797","WHEN it was first perceived, in early times, that no middle course for America remained between unlimited submission to a foreign le..."
3,7,Thomas Jefferson,First Inaugural Address,"Wednesday, March 4, 1801","Friends and Fellow-Citizens: CALLED upon to undertake the duties of the first executive office of our country, I avail myself of th..."
4,8,Thomas Jefferson,Second Inaugural Address,"Monday, March 4, 1805","PROCEEDING, fellow-citizens, to that qualification which the Constitution requires before my entrance on the charge again conferred ..."


Realizando um tratamento nos dados:

In [20]:
# Selecionando linhas em que constam o primeiro discurso de cada presidente
df = df.drop_duplicates(subset=['Name'], keep='first')
# Limpando o índice
df = df.reset_index()
# Selecionando as colunas dos nomes dos Presidentes e seus discursos
df = df[['Name', 'text']]
# Setando um índice para o nome dos presidentes
df = df.set_index('Name')

df.head()

Unnamed: 0_level_0,text
Name,Unnamed: 1_level_1
George Washington,Fellow-Citizens of the Senate and of the House of Representatives: AMONG the vicissitudes incident to life no event could have fille...
John Adams,"WHEN it was first perceived, in early times, that no middle course for America remained between unlimited submission to a foreign le..."
Thomas Jefferson,"Friends and Fellow-Citizens: CALLED upon to undertake the duties of the first executive office of our country, I avail myself of th..."
James Madison,"UNWILLING to depart from examples of the most revered authority, I avail myself of the occasion now presented to express the profoun..."
James Monroe,I SHOULD be destitute of feeling if I was not deeply affected by the strong proof which my fellow-citizens have given me of their co...


Limpeza dos dados, através de expressões regulares

In [21]:
# Definindo um função de limpeza dos textos
def clean_text(text):
    '''Do text lowercase, remove text in square brackets, remove punctuation, remove read errors, and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), ' ', text)
    text = re.sub('\w*\d\w*', ' ', text)
    text = re.sub('�', ' ', text)
    return text

In [22]:
clean1 = lambda x: clean_text(x)
# Limpeza dos discursos
df["text"] = df["text"].apply(clean1)

In [23]:
df.head()

Unnamed: 0_level_0,text
Name,Unnamed: 1_level_1
George Washington,fellow citizens of the senate and of the house of representatives among the vicissitudes incident to life no event could have fille...
John Adams,when it was first perceived in early times that no middle course for america remained between unlimited submission to a foreign le...
Thomas Jefferson,friends and fellow citizens called upon to undertake the duties of the first executive office of our country i avail myself of th...
James Madison,unwilling to depart from examples of the most revered authority i avail myself of the occasion now presented to express the profoun...
James Monroe,i should be destitute of feeling if i was not deeply affected by the strong proof which my fellow citizens have given me of their co...


Realizando a lematização (processo de agrupar as formas flexionadas de uma palavra para que possam ser analisadas como um único item). Ou seja, palavras no plural vão para o singular, verbos vão para o tempo presente.

In [24]:
# Extração de substantivos e função de lematização
def nouns(text):
    '''Given a string of text, tokenize the text and pull out only the nouns.'''
    # Criando uma máscara para isolar os substantivos
    is_noun = lambda pos: pos[:2] == 'NN'
    # Função para separar uma string de palavras em uma lista de palavras (tokens)
    tokenized = word_tokenize(text)
    # Função para lematizar cada palavra
    wordnet_lemmatizer = WordNetLemmatizer()
    # List para lematizar todas as palavras e criar uma lista de substantivos
    all_nouns = [wordnet_lemmatizer.lemmatize(word) \
    for (word, pos) in pos_tag(tokenized) if is_noun(pos)] 
    
    # Retornar string de join das listas de substantivos
    return ' '.join(all_nouns)


In [29]:
# Create dataframe of only nouns from speeches
data_nouns = pd.DataFrame(df.text.apply(nouns))
# Visually Inspect
data_nouns.head()

Unnamed: 0_level_0,text
Name,Unnamed: 1_level_1
George Washington,citizen senate house representative vicissitude incident life event anxiety notification order day month hand i country voice i veneration love re...
John Adams,time course america submission legislature independence claim men reflection danger power fleet army contest dissension form government part count...
Thomas Jefferson,friend citizen duty executive office country i myself presence portion fellow citizen thanks favor consciousness task talent presentiment greatnes...
James Madison,example authority i myself occasion profound impression call country station duty i sanction mark confidence proceeding deliberate suffrage nation...
James Monroe,i destitute i proof fellow citizen confidence office function expression opinion conduct service gratification sensibility estimate importance tru...
