# <font color='#5581A5'> Natural Language Processing

<font color=#5581A5> **The objective of this project is twofold: to delve into the study of natural language processing (NLP) while advancing further in the exploration of Jupyter Notebook and Python.**

Several practices will be employed in this study:

1- Below certain commands, there will be a summary of their meanings.

2- All text will be written in English.

3- The data has been extracted from exercises on the Alura platform.

# About

In this project, two main topics are covered:

Regular Expressions (Regex) for Natural Language Processing (NLP).
How to use Regex in NLP and specifically in Python.
Language Models.
    
What they are, how to build them, and practical applications, such as creating an automatic language detector.
The course starts by utilizing data from Stackoverflow, which contains irrelevant elements like tags and code snippets. Regex is used to clean these texts. After cleaning, the creation of the language model begins. However, a simple model might not meet all requirements and can produce strange results.

- Regex (Regular Expressions):

A sequence of characters that forms a search pattern, mainly used for string matching and manipulation. In NLP, Regex can be used to identify and clean specific patterns in text data.
Natural Language Processing (NLP):

A field of artificial intelligence focused on the interaction between computers and humans through natural language. It involves the ability to read, understand, and generate human language.
Language Models:

Computational models that can understand and generate human language. They are used for various applications, such as language translation, sentiment analysis, and automatic text generation.
Stackoverflow:

A popular question-and-answer website for programmers. It has a vast database of questions and answers, including code snippets, which can be used for data analysis and model training.
Tags and Code Snippets:

- Tags: Labels used to categorize content, often irrelevant in NLP tasks focused on textual content.
- Code Snippets: Small blocks of code embedded within text, which can be noisy data in language processing tasks.

In [1]:
# Imports

import nltk
import pandas as pd
import re

from nltk import bigrams
from nltk.lm.preprocessing import pad_both_ends
from sklearn.model_selection import train_test_split
from nltk.tokenize import WhitespaceTokenizer
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.lm import NgramCounter
from nltk.lm import Laplace

data_pt = pd.read_csv('https://raw.githubusercontent.com/alura-cursos/nlp-modelos-linguagem/master/dataset/stackoverflow_portugues.csv')
data_en = pd.read_csv('https://raw.githubusercontent.com/alura-cursos/nlp-modelos-linguagem/master/dataset/stackoverflow_ingles.csv')

In [2]:
# Checking dataset portuguese

data_pt

Unnamed: 0,Id,Título,Questão,Tags,Pontuação,Visualizações
0,2402,Como fazer hash de senhas de forma segura?,"<p>Se eu fizer o <em><a href=""http://pt.wikipe...",<hash><segurança><senhas><criptografia>,350,22367
1,6441,Qual é a diferença entre INNER JOIN e OUTER JOIN?,<p>Qual é a diferença entre <code>INNER JOIN</...,<sql><join>,276,176953
2,579,Por que não devemos usar funções do tipo mysql_*?,<p>Uma dúvida muito comum é por que devemos pa...,<php><mysql>,226,9761
3,2539,As mensagens de erro devem se desculpar?,<p>É comum encontrar uma mensagem de erro que ...,<aplicação-web><gui><console><ux>,214,5075
4,17501,"Qual é a diferença de API, biblioteca e Framew...",<p>Me parecem termos muito próximos e eventual...,<api><framework><terminologia><biblioteca>,193,54191
...,...,...,...,...,...,...
495,194857,O que é Polyfill?,<p>Já vi esse termo <em>Polyfill</em> sendo ut...,<javascript><terminologia><polyfill>,26,6860
496,323137,Pra que serve o comando LOCK TABLES?,<p>Esses dias me deparei com um trecho de um S...,<mysql>,26,657
497,232958,O que é um valor opaco?,<p>Por vezes vejo em documentações ou especifi...,<nomenclatura>,26,587
498,227907,"O que são Proxy, Gateway e Tunnel no protocolo...","<p>Na especificação do protocolo HTTP, mais pr...",<http>,26,625


In [3]:
# Checking dataset english

data_en

Unnamed: 0,Id,Título,Questão,Tags,Pontuação,Visualizações
0,11227809,Why is it faster to process a sorted array tha...,<p>Here is a piece of C++ code that seems very...,<java><c++><performance><optimization><branch-...,23057,1358574
1,927358,How do I undo the most recent local commits in...,<p>I accidentally committed the wrong files to...,<git><version-control><git-commit><undo>,19640,7906137
2,2003505,How do I delete a Git branch locally and remot...,<p>I want to delete a branch both locally and ...,<git><git-branch><git-remote>,15249,6940906
3,292357,What is the difference between 'git pull' and ...,<blockquote>\n <p><strong>Moderator Note:</st...,<git><git-pull><git-fetch>,11008,2543052
4,477816,What is the correct JSON content type?,"<p>I've been messing around with <a href=""http...",<json><http-headers><content-type>,9701,2478940
...,...,...,...,...,...,...
495,6237295,How can I update NodeJS and NPM to the next ve...,<p>I just installed <code>Node.js</code> and <...,<node.js><npm><node-modules><npm-update>,1477,1279392
496,9033,Hidden Features of C#?,<p>This came to my mind after I learned the fo...,<c#><hidden-features>,1476,659870
497,5411538,Redirect from an HTML page,<p>Is it possible to set up a basic HTML page ...,<html><xhtml><meta><html-head>,1476,3091553
498,2763006,Make the current Git branch a master branch,<p>I have a repository in Git. I made a branch...,<git>,1474,653714


In [4]:
# Using regex to find a pattern on column

# Gathering the first "questao" in portuguese

question_pt = data_pt.Questão[0]

# Applying regex

re.findall(r'<p>', question_pt)

['<p>', '<p>', '<p>', '<p>']

In [5]:
# Applying refined regex to find all tags

re.findall(r'<.*?>', question_pt)

['<p>',
 '<em>',
 '<a href="http://pt.wikipedia.org/wiki/Fun%C3%A7%C3%A3o_de_embaralhamento_criptogr%C3%A1fico" rel="noreferrer">',
 '</a>',
 '</em>',
 '</p>',
 '<p>',
 '<a href="http://pt.wikipedia.org/wiki/Ataque_de_for%C3%A7a_bruta" rel="noreferrer">',
 '</a>',
 '<em>',
 '<a href="http://pt.wikipedia.org/wiki/Keylogger" rel="noreferrer">',
 '</a>',
 '</em>',
 '<a href="http://pt.wikipedia.org/wiki/Criptoan%C3%A1lise_de_mangueira_de_borracha" rel="noreferrer">',
 '<em>',
 '</em>',
 '</a>',
 '<em>',
 '</em>',
 '</p>',
 '<p>',
 '</p>',
 '<p>',
 '<em>',
 '</em>',
 '</p>']

In [6]:
# Chaning the text inside the tags

sub_pt = re.sub(r'<.*?>', ' T---E---S---T ', question_pt)
print(sub_pt)

 T---E---S---T Se eu fizer o  T---E---S---T  T---E---S---T hash T---E---S---T  T---E---S---T  de senhas antes de armazená-las em meu banco de dados é suficiente para evitar que elas sejam recuperadas por alguém? T---E---S---T 

 T---E---S---T Estou falando apenas da recuperação diretamente do banco de dados e não qualquer outro tipo de ataque, como  T---E---S---T força bruta T---E---S---T  na página de login da aplicação,  T---E---S---T  T---E---S---T keylogger T---E---S---T  T---E---S---T  no cliente e  T---E---S---T criptoanálise  T---E---S---T rubberhose T---E---S---T  T---E---S---T . Qualquer forma de  T---E---S---T hash T---E---S---T  não vai impedir esses ataques. T---E---S---T 

 T---E---S---T Tenho preocupação em dificultar ou até impossibilitar a obtenção das senhas originais caso o banco de dados seja comprometido. Como dar maior garantia de segurança neste aspecto? T---E---S---T 

 T---E---S---T Quais preocupações adicionais evitariam o acesso às senhas? Existem formas melho

In [7]:
# Creating fuction to remove html tags

def remove(texts, regex):
    if type(texts) == str:
        return regex.sub("", texts)
    else:
        return [regex.sub("", text) for text in texts]

# Testing

regex_html = re.compile(r'<.*?>')
no_tag_question = remove(question_pt, regex_html)
print(no_tag_question)

Se eu fizer o hash de senhas antes de armazená-las em meu banco de dados é suficiente para evitar que elas sejam recuperadas por alguém?

Estou falando apenas da recuperação diretamente do banco de dados e não qualquer outro tipo de ataque, como força bruta na página de login da aplicação, keylogger no cliente e criptoanálise rubberhose. Qualquer forma de hash não vai impedir esses ataques.

Tenho preocupação em dificultar ou até impossibilitar a obtenção das senhas originais caso o banco de dados seja comprometido. Como dar maior garantia de segurança neste aspecto?

Quais preocupações adicionais evitariam o acesso às senhas? Existem formas melhores de fazer esse hash?



In [8]:
# Creating function to remove a code inside the text using the <code> tag

def remove_code(texts, regex):
    if type(texts) == str:
        return regex.sub("CODE", texts)
    else:
        return [regex.sub("CODE", text) for text in texts]

# Testing

question_en = data_en.Questão[0]
regex_code = re.compile(r'<code>(.|(\n))*?</code>')
no_code_question = remove_code(question_en, regex_code)
print(no_code_question)

<p>Here is a piece of C++ code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.</p>

<pre class="lang-cpp prettyprint-override">CODE</pre>

<ul>
<li>Without CODE, the code runs in 11.54 seconds.</li>
<li>With the sorted data, the code runs in 1.93 seconds.</li>
</ul>

<p>Initially, I thought this might be just a language or compiler anomaly. So I tried it in Java.</p>

<pre class="lang-java prettyprint-override">CODE</pre>

<p>With a somewhat similar but less extreme result.</p>

<hr>

<p>My first thought was that sorting brings the data into the cache, but then I thought how silly that is because the array was just generated.</p>

<ul>
<li>What is going on?</li>
<li>Why is it faster to process a sorted array than an unsorted array?</li>
<li>The code is summing up some independent terms, and the order should not matter.</li>
</ul>



In [9]:
# Now lets apply the subs to the dataset

question_en_no_code = remove_code(data_en.Questão, regex_code)
question_en_no_tags  = remove(question_en_no_code, regex_html)
data_en['No Tags'] = question_en_no_tags

question_pt_no_code = remove_code(data_pt.Questão, regex_code)
question_pt_no_tags  = remove(question_pt_no_code, regex_html)
data_pt['No Tags'] = question_pt_no_tags

In [10]:
# Another treatments

# Lowering all words

def lower_text(texts):
    if type(texts) == str:
        return texts.lower()
    else:
        return [text.lower() for text in texts]

# Regex punctuation

regex_punctuation = re.compile(r'[^\w\s]')

# Regex duplicate spaces

regex_space = re.compile(r' +')

# Regex break line

regex_break_line = re.compile(r'(\n)')

# Regex remove digits

regex_digits = re.compile(r'\d+')

def remove_space(texts, regex):
    if type(texts) == str:
        return regex.sub(" ", texts)
    else:
        return [regex.sub(" ", text) for text in texts]

In [11]:
# Removing punctuation

question_en_no_punc = remove(question_en_no_tags, regex_punctuation)
question_en_min = lower_text(question_en_no_punc)
question_en_no_digits = remove(question_en_min, regex_digits)
question_en_no_break = remove_space(question_en_no_digits, regex_break_line)
question_en_no_space = remove_space(question_en_no_break, regex_space)

question_pt_no_punc = remove(question_pt_no_tags, regex_punctuation)
question_pt_min = lower_text(question_pt_no_punc)
question_pt_no_digits = remove(question_pt_min, regex_digits)
question_pt_no_break = remove_space(question_pt_no_digits, regex_break_line)
question_pt_no_space = remove_space(question_pt_no_break, regex_space)

# Creating new column

data_en['Treated Data'] = question_en_no_space
data_pt['Treated Data'] = question_pt_no_space

# Creating new column

data_en['language'] = 'English'
data_pt['idioma'] = 'Portugues'

In [12]:
# Separating data in train/test english

en_train, en_test = train_test_split(data_en['Treated Data'], test_size=0.2, random_state=123)

# Agregating all words to variable and applying tokenize

all_words_en = ''.join(en_train)
all_words_token_en = WhitespaceTokenizer().tokenize(all_words_en)
print(len(all_words_token_en))

21510


In [13]:
# Separating data in train/test portuguese

pt_train, pt_test = train_test_split(data_pt['Treated Data'], test_size=0.2, random_state=123)

# Agregating all words to variable and applying tokenize

all_words_pt = ''.join(pt_train)
all_words_token_pt = WhitespaceTokenizer().tokenize(all_words_pt)
print(len(all_words_token_pt))

36716


In [14]:
# Applying padding to bigram en

train_bigram_en, vocab_en = padded_everygram_pipeline(2, all_words_token_en)

# Applying padding to bigram pt

train_bigram_pt, vocab_pt = padded_everygram_pipeline(2, all_words_token_pt)

In [15]:
# Creating MLE model english

model_en = MLE(2)
model_en.fit(train_bigram_en, vocab_en)

# Creating MLE model Portuguese

model_pt = MLE(2)
model_pt.fit(train_bigram_pt, vocab_pt)

In [16]:
# Lets try to see what are the common letters that in the first letter of the bigram has 'd'

model_en.counts[['d']].items()

# We can check that the fake char is the most common 'letter' after d

dict_items([('o', 401), ('</s>', 1556), ('e', 1498), ('a', 119), ('d', 63), ('l', 16), ('i', 293), ('r', 37), ('y', 22), ('s', 65), ('n', 12), ('_', 2), ('p', 3), ('u', 49), ('c', 5), ('b', 2), ('k', 7), ('v', 1), ('t', 2), ('q', 2), ('g', 5), ('j', 2), ('f', 1), ('h', 1)])

In [17]:
# Creating a test phrase

phrase = 'good morning'
words = WhitespaceTokenizer().tokenize(phrase)

# Applying fakechar on words

words_fakechar = [list(pad_both_ends(word, n = 2)) for word in words]
words_bigram = [list(bigrams(word)) for word in words_fakechar]
print(words_bigram)

[[('<s>', 'g'), ('g', 'o'), ('o', 'o'), ('o', 'd'), ('d', '</s>')], [('<s>', 'm'), ('m', 'o'), ('o', 'r'), ('r', 'n'), ('n', 'i'), ('i', 'n'), ('n', 'g'), ('g', '</s>')]]


In [18]:
# Calculating perplexity of the word compared to our model

print(model_en.perplexity(words_bigram[0]), model_en.perplexity(words_bigram[1]))

17.26133308212415 12.900755877319751


In [19]:
# Creating function to train model

def model_train_mle(text_list):
    all_questions = ' '.join(text_list)
    all_words = WhitespaceTokenizer().tokenize(all_questions)
    bigrams, vocab = padded_everygram_pipeline(2, all_words)
    model = MLE(2)
    model.fit(bigrams, vocab)
    return model

# Checking to see if the results are the same

model_test_en = model_train_mle(en_train)
print(model_test_en.perplexity(words_bigram[0]), model_test_en.perplexity(words_bigram[1]))

17.26133308212415 12.900755877319751


In [20]:
# Creating function to calculate perplexity

def perplexity_calculate(model, text):

    perplexity = 0
    words = WhitespaceTokenizer().tokenize(text)
    words_fakechar = [list(pad_both_ends(word, n = 2)) for word in words]
    words_bigram = [list(bigrams(word)) for word in words_fakechar]


    for word in words_bigram:
        perplexity += model.perplexity(word)

    return perplexity

# Testing

print(perplexity_calculate(model_en, "good morning"))

30.1620889594439


In [21]:
# Calculating for our first row

print(perplexity_calculate(model_en, en_test.iloc[0]))

195.21299687285554


In [22]:
# Now let see how it goes on pt_test

print(perplexity_calculate(model_en, pt_test.iloc[0]))

# In this situation the probability goes to infinite since theres no word on pt_test related to model_en, so by using perplexity we are dividing 1/0 which tends to infinite

inf


In [23]:
# Lets use laplace model to avoid those probabilities

def model_train_laplace(text_list):
    all_questions = ' '.join(text_list)
    all_words = WhitespaceTokenizer().tokenize(all_questions)
    bigrams, vocab = padded_everygram_pipeline(2, all_words)
    model = Laplace(2)
    model.fit(bigrams, vocab)
    return model

In [24]:
# Creating laplace model english

model_en_laplace = model_train_laplace(en_test)

# Checking if the result tends to infinite

print(perplexity_calculate(model_en_laplace, pt_test.iloc[0]))

5065.0486897292585


In [25]:
# Creating laplace model english

model_pt_laplace = model_train_laplace(pt_train)

# Checking if the result tends to infinite

print(perplexity_calculate(model_pt_laplace, pt_test.iloc[0]))

2009.1937946178912


In [26]:
# Creating function to decide in what language is the text

def language_decision(text_list):
    language = []
    for text in text_list:
        pt = perplexity_calculate(model_pt_laplace, text)
        en = perplexity_calculate(model_en_laplace, text)
        if en >= pt:
            language.append('port')
        else:
            language.append('eng')

    return language

In [27]:
# Checking if the function is working

results_en = language_decision(en_test)
print(len(en_test), results_en.count('eng'))

100 100
