# Natural language processing for Topic Modeling


Using python + natural language processing for topic modeling: a unsupervised technique for document classification

https://github.com/erickfis


## Topic modeling

Imagine you have a huge collection of documents, each one talking about a specific matter|subject.
A document could be a movie description, or part of a book, a message, a tweet, etc...


If you take the time to read each document, you will learn that they are talking about science, or politics,
or medicine, or sports, etc, but very often they don't have a label specifying the subject.


Now it's your job to group them by subject. Will you read each one of them and label one by one?
What if your collection contains 1 billion documents?


This is where *Topic modeling* comes in hand: it is a very useful technique for document classification through unsupervised learning. It will learn from the collection of documents as a whole and then suggest groups (clusters) of documents by similarities, such as frequency or probabilities for words on each document.


After the documents are split into the suggested groups, we can then look at each group (through samples of them) and choose a proper label for it.


This clustering of documents by topic|subject can be achieved by different techniques:

- tf-idf + clustering
- tf-idf + PCA
- Latent semantic analysis - LSA
- Latent Dirichlet Allocation - LDA

On this notebook we will discuss the LDA method.

## LDA - Latent Dirichlet Allocation

LDA is a especial case of the Latent semantic analysis, where the priors distribution of topics are assumed to be
of the beta multivariate type, aka. Dirichlet distribution.

The main advantage of LDA over LSI, PCA or regular clustering is that LDA is capable of detecting intermediary topics between the ones that would be detected by the former, as they will work on principal components and detect only orthogonal
topics. Thus, LDA reduces overfitting and increases accuracy.


On the other hand, LDA demands more computation time.


https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation


## Use case: chat bots

Imagine we are a big on line retailer and we want to provide a new communication channel for our costumers.

We had a real time on line chat before and we were using a human attendant to answer our costumers. We have stored all of the costumers messages on our database and now we want to create an algorithm that is capable of answering costumers just like a human attendant would.

But how to answer our costumers in a natural way, simulating the behavior of a human attendant?

The achieve that goal, we are going to use natural language processing and topic modeling. We will combine different techniques for analyzing the costumers queries and questions to elaborate an appropriate natural answer.

The techniques are:

- topic modeling - to orient the composition of a proper answer according to the detected subject of conversation.
IE: if the subject is a complaint, the algorithm should compose an answer taking this information into account.
If the message is a salutation, then salute back.
- keyword detection, as order number, or some specific product.


## Retrieving stored messages from our database

In order to "train" our algorithm, we must retrieve stored data from our databases.
Note that this is not the typical training process because we don't have any targets or labels ready at this point. This is, in fact, a unsupervised training process.

If the database in question is a NoSQL, like MongoDB, we could run queries like:

    db.mensagens.find(
        {
        $and: [
                {"user": {$in: ["user_1", "user_2", "attd_1", "sell_1"]}},
                {"message": {$ne: null}}
              ]
        }
    ).pretty()

on MongoDB Compass.


If the messages are stored in a xml format, we could use the Beautiful Soup library to scrap the data:

    db = BeautifulSoup(open('db.xml').read(), "lxml")
    messages = db.findAll('message')

If the messages are stored on a txt file, we could scrap then using something as simple as:

    messages = []
    with open(db.txt, "rb") as incoming:
            for line in incoming:
                if line.startswith('user'):
                    messages.append(line)

### Getting the messages from the available database

For this very particular notebook, we are going to use a very small set of messages containing some conversation between costumers and the attendants. The costumers are identified by the id "user_1"


In [1]:
conversations = [

    # small talk
    [
        {'user': 'user_1', 'message': 'Hi, how are you?', 'status': ''},
        {'user': 'user_2', 'message': 'fine! and you?   ', 'status': ''},
        {'user': 'user_1', 'message': ' I\'m ok!!!', 'status': ''},
        {'user': 'user_1', 'message': 'got any sale today?', 'status': ''},
        {'user': 'user_2', 'message': 'today we have a 50\"tv" \n ', 'status': ''},
    ],
    
    # customer service
    [
        {'user': 'user_1', 'message': 'Where is my iphone?!'                          , 'status': 'payment_approved'},
        {'user': 'attd_1', 'message': 'Hello, your payment has been approved'         , 'status': 'payment_approved'},
        {'user': 'attd_1', 'message': 'Is the product on delivery route'              , 'status': 'payment_approved'},
        {'user': 'user_1', 'message': 'But it\'s 5 days already!'                     , 'status': 'payment_approved'},
        {'user': 'attd_1', 'message': 'our delivery should take five working days '   , 'status': 'payment_approved'},
        {'user': 'user_1', 'message': 'ow it\'s true'                                 , 'status': 'payment_approved'},
    ],
    
    # sale
    [
        {'user': 'user_1', 'message': 'Where is the iphone 10?'                     , 'status': 'shopping'},
        {'user': 'sell_1', 'message': 'Hello! the iphone X it is out of stock;'     , 'status': 'shopping'},
        {'user': 'user_1', 'message': 'Huum, what do you have available?'           , 'status': 'shopping'},
        {'user': 'sell_1', 'message': 'We have the iphone X plus and the samsung s8', 'status': 'shopping'},
        {'user': 'user_1', 'message': 'Is the samsung better than the iphone?'      , 'status': 'shopping'},
        {'user': 'sell_1', 'message': 'They are different, but they are the best'   , 'status': 'shopping'},
    ],
]

In [2]:
# lets retrieve only the costumers messages
user_messages = []

for chat in conversations:
    user_messages.append([message["message"] for message in chat if message["user"]=="user_1"])

user_messages = sum(user_messages, [])
user_messages

['Hi, how are you?',
 " I'm ok!!!",
 'got any sale today?',
 'Where is my iphone?!',
 "But it's 5 days already!",
 "ow it's true",
 'Where is the iphone 10?',
 'Huum, what do you have available?',
 'Is the samsung better than the iphone?']

In [3]:
# lets add some commom costumer messages to the list

add_msg = ["Good morning!",
              "Good night!",
              "Good evening",
              "I'd like to make a complaint",
              "I'd like to make a exchange",
              "I'd like to make a refund",
              "What's the best mobile?",
              "my smart phone is broken!",
              "I can't finish my purchase",
              "the site is not working!",
              "When it is going to arrive?",
              "What's is the delivery time?",
              "this device is really bad",
              "I need some help choosing a mobile",
              "do you accept credit cards?",
              "the tv arrived already broken",
              "the device arrived already broken",
              "the device doesn't work"
             ]

In [4]:
all_msg = user_messages+add_msg
all_msg

['Hi, how are you?',
 " I'm ok!!!",
 'got any sale today?',
 'Where is my iphone?!',
 "But it's 5 days already!",
 "ow it's true",
 'Where is the iphone 10?',
 'Huum, what do you have available?',
 'Is the samsung better than the iphone?',
 'Good morning!',
 'Good night!',
 'Good evening',
 "I'd like to make a complaint",
 "I'd like to make a exchange",
 "I'd like to make a refund",
 "What's the best mobile?",
 'my smart phone is broken!',
 "I can't finish my purchase",
 'the site is not working!',
 'When it is going to arrive?',
 "What's is the delivery time?",
 'this device is really bad',
 'I need some help choosing a mobile',
 'do you accept credit cards?',
 'the tv arrived already broken',
 'the device arrived already broken',
 "the device doesn't work"]

# Pre-processing

In order to get the messages classified by topics, we must perform some pre-processing on them. 

The LDA Topic Modeling sees the documents as bag of words (BOW), 
so we need to start by transforming each message that way.

The first step to get our BOW, we must build a token generator that provides:

- lowercase on each word
- remove numbers, we are assuming here that they won't help here
- remove small words, with less than 2 characters long
- spell check
- stem|lemmatize the words (IE get only their "root")

Those restrictions are going to be applied in order to avoid unnecessary complexity: there's no evident gains otherwise.

So, we are assuming the following words to be seen as the same by our model:
- Mobile|mobile|mobiles
- boy|boys
- girl|girls
- samsung|sansumg|Sannsungui

## Spell Check

The chosen spell checker is the *enchant project*: https://github.com/AbiWord/enchant

It depends on the local dictionary, so we must install myspell:

    # sudo apt-get install myspell-en-us

Besides, lets add some words to the dictionary, as manufactures and products:

In [5]:
terms = ["samsung", "motorola", "apple", "iphone", "pixel", "google", "I'd" ]

file = "terms.txt"
with open(file, "w") as text_file:
    for term in terms:
        print(term, file=text_file)


In [6]:
import enchant
d = enchant.DictWithPWL("en_US",file)
d.check("Samsung") # if true, it means that the spell checker knowns the word

True

### Getting better suggestions from enchant

The enchant.suggest() method provides a list of candidates for fixing the spelling. 
Usually the first option is the best, but it doesn't always works as expected:

In [7]:
d.suggest("Samjung")

['Jung', 'Smugging', 'Samarkand', 'Samsung']

Instead, we should chose a proper fix through a similarity comparison:

Thus, we will use the methods:

    difflib.SequenceMatcher()
    difflib.SequenceMatcher().ratio()

The sequenceMatcher method compares pairs in a human friendly way. 
The ratio() method evaluates the similarity of the pair. Values above 0.6 indicates we have a match.

In [8]:
import difflib
difflib.SequenceMatcher(None, "samsung", "sony").ratio()

0.36363636363636365

In [9]:
difflib.SequenceMatcher(None, "samsung", "sansumg").ratio()

0.7142857142857143

In [10]:
def spell_checker(word):
    
    best_fix = ""
    best_ratio = 0 # começando com similaridade 0

    sugestions = set(d.suggest(word))
    for sugestion in sugestions:
        tmp = difflib.SequenceMatcher(None, word, sugestion).ratio()
        if tmp > best_ratio:
            best_fix = sugestion
            best_ratio = tmp # aumenta o nível para próximas comparações

    return best_fix


In [11]:
spell_checker("samsungui")

'samsung'

In [12]:
spell_checker("samjung")

'samsung'

## The token generator


The token generator will transform the messages into BOW.

It musts:

- lowercase on each word
- split the message into words, removing numbers
- remove small words, with less than 2 characters long
- apply spell check
- stem or lemmatize the words


Stemming is faster but lemmatizing is more precise, although it takes more computing time.

From the wikipedia

    Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

http://en.wikipedia.org/wiki/Lemmatisation


The choice between the former or the later really depends on the application. Sometimes just stemming will be enough.


In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
lemma = WordNetLemmatizer()
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
import re # for regular expressions

In [14]:
def tokenizator(message, size=2, fix=1, lemmatize=1):

    message = message.lower() # lowercase
    
    # get tokens, including acentuation, exclude pontuation and numbers
    tokens = re.findall("[-'a-zA-ZÀ-ÖØ-öø-ÿ]+", message) 
    
    ## filter words by size
    tokens = [token for token in tokens if len(token) > size] 
    
    # spell check and correction only if needed
    if fix:
        tokens = [spell_checker(token) if not d.check(token) else token for token in tokens]
    
    # stemming words
    if lemmatize:
        tokens = [lemma.lemmatize(t) for t in tokens] 
    
    # lets keep stop words by now because the documents are already too small
    #tokens = [t for t in tokens if t not in stopWords] # remove stopwords
    
    return tokens

### Applying the token generator

Lets now transform the messages from our database in tokens, as this is required to later obtain our bag of words.

In [15]:
tokens = map(lambda x: tokenizator(x), all_msg)
msg_pro = list(tokens)

In [16]:
import pandas as pd
compare = pd.DataFrame({"origin": all_msg, "tokenized": msg_pro})
compare

Unnamed: 0,origin,tokenized
0,"Hi, how are you?","[how, are, you]"
1,I'm ok!!!,[ism]
2,got any sale today?,"[got, any, sale, today]"
3,Where is my iphone?!,"[where, iphone]"
4,But it's 5 days already!,"[but, it's, day, already]"
5,ow it's true,"[it's, true]"
6,Where is the iphone 10?,"[where, the, iphone]"
7,"Huum, what do you have available?","[hum, what, you, have, available]"
8,Is the samsung better than the iphone?,"[the, samsung, better, than, the, iphone]"
9,Good morning!,"[good, morning]"


## Applying the LDA


The *gensin* package brings the tools needed to implement a LDA analysis in python

https://radimrehurek.com/gensim/models/ldamodel.html



In [17]:
from gensim import corpora, models

dictionary = corpora.Dictionary(msg_pro) # getting a dictionary from our collection



In [18]:
body = [dictionary.doc2bow(msg) for msg in msg_pro] # term matrix from our collection

## Training our LDA model

We must choose a starting number of topics according to our knowledge about the collection, just as we would do when performing a k-mens clustering, where we have to choose a starting number of clusters. Later we will analyze the proper metrics and decide whether to increase or decrease the number of clusters|topics.

Setting the parameters for the lda training:

- lets start assuming our collection contains 5 different topics.
- lets run 100 passes over the collection until it reaches convergence about topic separation.
- the alpha parameter its about the document-topic density. A higher value indicates that each document contains more topics. We expect the messages from our costumers to be about one topic only, so we should use a small number here.


Lets also record the time it takes to finish the process.


In [49]:
import time
import random

start = time.time()
random.seed(95276)
model = models.ldamodel.LdaModel(body, gamma_threshold=.01, minimum_phi_value=.005, 
                                 per_word_topics=True, minimum_probability=.01,
    num_topics=6, id2word = dictionary, passes=100, alpha=.4, eta=5, random_state=95276)

    # alpha (document/topic relationship) and eta (topic-words relationship) 
    # could be set to learn from data - "auto" setting
    
print("\n --- %s seconds ---" % round((time.time() - start),4))


 --- 2.6708 seconds ---


In [53]:
df = pd.DataFrame({"messages": all_msg})
df["tokens"] = df.messages.apply(lambda x: tokenizator(x))
df["topic"] = df.messages.apply(lambda x: get_topic(x))
df.sort_values(by="topic")

Unnamed: 0,messages,tokens,topic
4,But it's 5 days already!,"[but, it's, day, already]",0
22,I need some help choosing a mobile,"[need, some, help, choosing, mobile]",0
19,When it is going to arrive?,"[when, going, arrive]",0
26,the device doesn't work,"[the, device, doesn't, work]",2
24,the tv arrived already broken,"[the, arrived, already, broken]",2
3,Where is my iphone?!,"[where, iphone]",2
6,Where is the iphone 10?,"[where, the, iphone]",2
21,this device is really bad,"[this, device, really, bad]",2
8,Is the samsung better than the iphone?,"[the, samsung, better, than, the, iphone]",2
20,What's is the delivery time?,"[what's, the, delivery, time]",2


### Showing the 3 main terms for each topic

In [20]:
model.print_topics(num_topics=6, num_words=3)

[(0, '0.019*"you" + 0.017*"accept" + 0.017*"credit"'),
 (1, '0.017*"good" + 0.016*"evening" + 0.016*"morning"'),
 (2, '0.028*"the" + 0.019*"iphone" + 0.018*"broken"'),
 (3, '0.017*"good" + 0.016*"any" + 0.016*"got"'),
 (4, '0.017*"good" + 0.016*"it\'s" + 0.016*"purchase"'),
 (5, '0.020*"it\'d" + 0.020*"make" + 0.020*"like"')]

### Serializing the model

So we don't have to train the model again - lets save it to disk.

In [21]:
import pickle

# writes to disk
pickle.dump(model, open("lda.model", 'wb'))
pickle.dump(dictionary, open("dictionary.model", 'wb'))

# loads back
model = pickle.load(open("lda.model", 'rb'))
dictionary = pickle.load(open("dictionary.model", 'rb'))

model.print_topics(num_topics=6, num_words=3)


[(0, '0.019*"you" + 0.017*"accept" + 0.017*"credit"'),
 (1, '0.017*"good" + 0.016*"evening" + 0.016*"morning"'),
 (2, '0.028*"the" + 0.019*"iphone" + 0.018*"broken"'),
 (3, '0.017*"good" + 0.016*"any" + 0.016*"got"'),
 (4, '0.017*"good" + 0.016*"it\'s" + 0.016*"purchase"'),
 (5, '0.020*"it\'d" + 0.020*"make" + 0.020*"like"')]

## Visual analysis of the topics

The bigger the distinction between groups, the better.
We can improve this distinction by adjusting the model parameters when training it.


In [None]:
import pyLDAvis.gensim
vis = pyLDAvis.gensim.prepare(model, body, dictionary)
pyLDAvis.enable_notebook()

vis

## Showing the identified topics

In [52]:
def get_topic(message):
    tokens = tokenizator(message)
    map = dictionary.doc2bow(tokens)
    # model[map] will return all the possible topics and a score that tells 
    # the probability of the message belonging to that topic.
    # 
    # we will order it to get the topic which has the biggest probability.
    # Also, if the probability is lower than 60%, we will chose to say that we just 
    # don't know what the message is about (category 6)
    guess = sorted(model[map][0], key=lambda y: y[1], reverse=True)[0] # retorna o tópico mais provável e sua pontuação
    return 6 if guess[1] < .2 else guess[0]

In [25]:
df = pd.DataFrame({"messages": all_msg})
df["tokens"] = df.messages.apply(lambda x: tokenizator(x))
df["topic"] = df.messages.apply(lambda x: get_topic(x))
df.sort_values(by="topic")

Unnamed: 0,messages,tokens,topic
4,But it's 5 days already!,"[but, it's, day, already]",0
5,ow it's true,"[it's, true]",0
11,Good evening,"[good, evening]",0
10,Good night!,"[good, night]",0
9,Good morning!,"[good, morning]",0
21,this device is really bad,"[this, device, really, bad]",1
26,the device doesn't work,"[the, device, doesn't, work]",2
18,the site is not working!,"[the, site, not, working]",2
25,the device arrived already broken,"[the, device, arrived, already, broken]",2
20,What's is the delivery time?,"[what's, the, delivery, time]",2


## Labeling the topics


After applying the model, we can now look at a small sample of each group so we can add a friendly label do it.

In [None]:
#showing a sample of each group, to aid labeling each one
df.groupby('topic').apply(lambda x: x.sample(frac=.8))

In [None]:
# um dicionário de tópicos
labels = ["delivery", "demand", "indication", "salutation", "comparisson", "problem", "unknown"]
labels = dict(zip(range(7), labels))
labels

In [None]:
df["rotulo"] = df.topic.replace(labels)
df

## Applying the model to new messages

In [None]:
message = "whn it's goin to arive?"
labels[get_topic(message)]

In [None]:
message = "ah!"
labels[get_topic(message)]

 # Keyword detection

The keywords, together with the topic prediction, will help to compose a helpful answer.
 
 
 ## Order Number
 

In [None]:
def get_order(msg):
    tokens = re.findall("[0-9]+", msg)
    order = [token for token in tokens if len(token)==10] # order number has size 10
    
    return order
    
get_order("just bought a iphone, the order is 1234567890")

## Device type

In [None]:
def get_device(msg):
    tokens = tokenizator(msg, size=1, lemmatize=0, fix=0)
    
    # uma lista de objetos de interesse do cliente - alimentar com os produtos vendidos
    # podemos utilizar NLP POS - parts of speech para reconhecer o objeto, mas aqui acabou o tempo =D
    
    possibilities = ["tv", "mobile", "television", "microwave", "site", "freezer", "tire", "pants"] 
    
    device = [token for token in tokens if token in possibilities]
    return device
    

get_device("I need info on a new mobile and freezer")
#get_aparelho("preciso de indicação para tv")

# Composing the answer

The final answer to the user message will be composed according to:

- the predicted topic - each topic will have an auxiliary function
- detected keywords


We will make use of the following auxiliary functions:

- get_topic() - ok
- get_device() - ok
- get_order() - ok
- order_status() # queries a order database to retrive info on the transaction
- salute_back() 
- unknown_msg()


In [None]:
import datetime

def salute_back(tipo="salute"):
    agora = datetime.datetime.now()
    hora = agora.hour
    
    if tipo == "salute":
        saudacao = "Good night"
        if hora in range(12,19):
            saudacao = "Good afternoon"
        if hora in range(6,13):
            saudacao = "Good morning"
        
        mensagem = saudacao + ", dear costumer!"
            
    else:
        mensagem = "It's {} hours".format(hora)
        
    return mensagem
        
salute_back()

In [None]:
salute_back("hora")

In [None]:
def order_status(number):
    return(" Status do pedido {}, de acordo com o db: status".format(number))

order_status(128312983)

In [None]:
unknown_counter = 0

In [None]:
def unknown_msg():
    global unknown_counter
    messages = ["It's all ok here on Earth. How can I help you?",
                "Please be more specific...",
                "I'm just a tired robot. Please explain, slowly...",
                "is this a new kind of joke?"
                ]
    
    if not unknown_counter:
        message = "{} {} and {}".format(salute_back(), salute_back("hora"), messages[0])
    else:
        message = messages[unknown_counter]
    
    unknown_counter += 1
    if unknown_counter == 4:
        unknown_counter = 1
    
    return message

unknown_msg()


Lets now create a dictionary containing answers to each topic.
The first answer on each class assumes that there are no keywords on the message.
The second message on each class assumes that keywords are present.

In [None]:
m_delivery = ["Our delivery time is 5 working days. Please inform the order number so I can fetch more information on it",
              "Our delivery time is 5 working days. Lets check the order status {}."]

m_request = ["Sure thing! What's the order number?",
                 "Ok. Lets check the order status {}."]

m_indication = ["You want indications for what type of device?",
                "Here are the best deals for {}."]

m_comparisson = ["We can help you to choose the best options. What kind of device are you looking for?",
                "These are the best options for {}."]

m_problem= ["Easy! Everything can be fixed. What's going on?",
              "Easy! Lets fix this issue with {} the best we can."]

all_answers = {
    0: m_delivery,
    1: m_request,
    2: m_indication,
    3: salute_back(),
    4: m_comparisson,
    5: m_problem,
    6: unknown_msg()
}

all_answers[3]

## Answering the costumer

Now that all the auxiliary functions are ready, we can compose our answer:

In [None]:
def sub_resp(var, topic): # ok, just one more auxiliary function
    if var:
        if var[0].isdigit():
            answer = all_answers[topic][1].format(var[0]) + status_pedido(var[0])
        else:
            answer = all_answers[topic][1].format(var[0])
    else:
        answer = all_answers[topic][0]
            
    return answer

def answer_costumer(message): # agora sim

    topic = get_topic(message)
    n_order = get_order(message)
    device = get_device(message)
    
    if topic in [0,1]:
        return sub_resp(n_order, topic)
    
    if topic in [2,4,5]:
        return sub_resp(device, topic)
    
    if topic == 3:
        return all_answers[topic]
    else:
        return all_answers[topic]
    
unknown_counter = 0

In [None]:
message = "got any sale on mobiles today?"

In [None]:
answer_costumer(message)

In [None]:
answer_costumer("good morning!")

In [None]:
answer_costumer("whacka whacka whacka!")

In [None]:
answer_costumer("the site isn't working")

In [None]:
answer_costumer("whats the eta for order 1234567890?")

# Scalability 

In order to optimize the processing time, the *gensim* package offers alternate ways to train a LDA model.

Among them, are:

- set up a cluster and compute the job in a distributed fashion.
- Instead of running 100 or n steps, the batch mode, we can use the "on line" mode, where only a subset of size m of the messages will be taken into account to train the model. After the model is ready, it will take another subset, process it and then update the model. It will keep processing the subsets and updating the final model.
- a mix of both modes


Next, we will compare both modes in terms of computation time.

To do so, lets simulate a collection 1000 times larger than ours.m o modo batch (10 passadas). Para isso, vamos simular um "corpo" com tamanho 1000 vezes maior que o nosso "corpo de mensagens"


In [None]:
import time
body2 = body*1000

In [None]:
start = time.time()

model_bath = models.ldamodel.LdaModel(body2, num_topics=6, id2word = dictionary, passes=10)
    
print("\n --- %s segundos ---" % round((time.time() - start),4))

It was necessary 51 seconds to process the model in batch mode on this machine (Core i5, 4Gb of RAM)

In [None]:
start = time.time()

model_online = models.ldamodel.LdaModel(body2, num_topics=6, id2word = dictionary, update_every=1, chunksize=100, passes=1)
    
print("\n --- %s segundos ---" % round((time.time() - start),4))

It was necessary 6 seconds to process the model in on line mode on this machine (Core i5, 4Gb of RAM)

In [None]:
model_bath.print_topics(num_words=3)

In [None]:
model_online.print_topics(num_words=3)

## About me


https://github.com/erickfis

My skills:

    - R & Python (Pandas, SciPy, scikit-learn, dplyr, caret)
    - BI Tableau, Google Vis, ggPlot, Shiny Dashboards
    - MySQL / Teradata, NoSQL, MongoDB, Apache Cassandra
    - Hadoop, Amazon Web Services – AWS EC2
    - Machine Learning & Regression models, Decision Trees, etc
    - Natural Language Processing – NLP
    - Clustering / Association rules
    - Inferential statistics, A/B testing
    - Exploratory data analysis
    - Git / Github
    - Rmarkdown Reproducible Research / Jupyter Notebooks
    - Physicist
    

*Erick Gomes Anastácio*
