# Working with text for machine learning

## 20 newsgroups

### We will be trying to solve the problem of predicting a newsgroup from a message sent to given newsgroup.

### The dataset is called “Twenty Newsgroups”. Here is the official description, quoted from the website:

    The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
    
### In this tutorial you will have a chance to fill the missing steps in the procedure

In [23]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

## First - let's try to solve a really trivial problem of text classifiation

Let's classify texts whether they endorse cats or the opposite:

In [87]:
cats_vs_dogs = pd.DataFrame([
    {"text":"I love cats, cats are cute", "love": 1},
    {"text":"Cats! so sweeeet", "love": 1},
    {"text":"Cats? Hate them, prefer dogs", "love": 0},
    {"text":"Dumb ass cat", "love": 0}
])
cats_vs_dogs

Unnamed: 0,love,text
0,1,"I love cats, cats are cute"
1,1,Cats! so sweeeet
2,0,"Cats? Hate them, prefer dogs"
3,0,Dumb ass cat


So we will try to predict 'love' using 'text'

## Machine learning models at low level work on numbers. we must extract numerical features from text in such way that similiar document have at least somehow similiar structure

### Very basic way is to rely on 'word occurence' - i.e. document is represented as a bag of words.

In [90]:
count_vect = CountVectorizer()
count_vect.fit(cats_vs_dogs['text'])
X_train_counts = count_vect.transform(cats_vs_dogs['text'])

#### Vectorization result: This is what the model will see

In [91]:
pd.DataFrame(X_train_counts.toarray(), columns=count_vect.get_feature_names())

Unnamed: 0,are,ass,cat,cats,cute,dogs,dumb,hate,love,prefer,so,sweeeet,them
0,1,0,0,2,1,0,0,0,1,0,0,0,0
1,0,0,0,1,0,0,0,0,0,0,1,1,0
2,0,0,0,1,0,1,0,1,0,1,0,0,1
3,0,1,1,0,0,0,1,0,0,0,0,0,0


#### This is our dictionary - i.e. all the words the model handles

In [92]:
count_vect.get_feature_names()

['are',
 'ass',
 'cat',
 'cats',
 'cute',
 'dogs',
 'dumb',
 'hate',
 'love',
 'prefer',
 'so',
 'sweeeet',
 'them']

### Count Vectorizes learns dictionary from the data
### Every document gets a dictionary that counts each word
### Further on it ignores unknown/new words

# Let's train a classification model on that:

In [93]:

nb = MultinomialNB()
y_train=cats_vs_dogs['love']
nb.fit(X_train_counts, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Let's see what did the model learn?

In [101]:
tokens = pd.DataFrame({'token':count_vect.get_feature_names(), 
                       'hate':nb.feature_count_[0, :], 
                       'love':nb.feature_count_[1, :]}).set_index('token')
tokens

Unnamed: 0_level_0,hate,love
token,Unnamed: 1_level_1,Unnamed: 2_level_1
are,0.0,1.0
ass,1.0,0.0
cat,1.0,0.0
cats,1.0,3.0
cute,0.0,1.0
dogs,1.0,0.0
dumb,1.0,0.0
hate,1.0,0.0
love,0.0,1.0
prefer,1.0,0.0


#### Let's see how many examples in each class?

In [102]:
nb.class_count_

array([ 2.,  2.])

In [105]:
# convert the ham and spam counts into frequencies
tokens['hate'] = (tokens.hate + 1 ) / nb.class_count_[0]
tokens['love'] = (tokens.love + 1 ) / nb.class_count_[1]
tokens['love_to_hate_ratio'] = tokens.love / tokens.hate
tokens

Unnamed: 0_level_0,hate,love,love_to_hate_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
are,0.75,1.0,1.333333
ass,1.0,0.75,0.75
cat,1.0,0.75,0.75
cats,1.0,1.5,1.5
cute,0.75,1.0,1.333333
dogs,1.0,0.75,0.75
dumb,1.0,0.75,0.75
hate,1.0,0.75,0.75
love,0.75,1.0,1.333333
prefer,1.0,0.75,0.75


### Let's see what are the predictions?

In [107]:
cats_vs_dogs['prediction'] = nb.predict(X_train_counts)
cats_vs_dogs

Unnamed: 0,love,text,prediction
0,1,"I love cats, cats are cute",1
1,1,Cats! so sweeeet,1
2,0,"Cats? Hate them, prefer dogs",0
3,0,Dumb ass cat,0


## Use the above example to train a proper model for classifying 20 newsgroup dataset

In [108]:
# we will limit to 4 categories here
selected_categories = ['alt.atheism',  'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast']

# Caution! we are using this api to get just a train dataset first
twenty_train = fetch_20newsgroups(subset='train', categories=selected_categories, shuffle=True, random_state=42)

In [109]:
twenty_train.target_names

['alt.atheism',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast']

In [110]:
a_message = twenty_train.data[0]
a_newsgroup_index = twenty_train.target[0]
a_newsgroup = twenty_train.target_names[a_newsgroup_index]
print("in newsgroup {} we have a message:\n{}".format(a_newsgroup,a_message[:300]))

in newsgroup talk.politics.mideast we have a message:
From: sera@zuma.UUCP (Serdar Argic)
Subject: X-Soviet Armenia denies the historical fact of the Turkish Genocide.
Reply-To: sera@zuma.UUCP (Serdar Argic)
Distribution: world
Lines: 61

In article <C5LxEw.9p0@panix.com> mpoly@panix.com (Michael S. Polymenakos) writes:

> Maybe with the availability o


### how many documents in each class?

In [29]:
target_name_for_observation = [twenty_train.target_names[i] for i in twenty_train.target]
target_name_for_observation
pd.DataFrame({'target_name':target_name_for_observation,'count':1}).groupby('target_name').count()

Unnamed: 0_level_0,count
target_name,Unnamed: 1_level_1
alt.atheism,480
soc.religion.christian,599
talk.politics.guns,546
talk.politics.mideast,564


### Task 1:
#### Instantiate CountVectorizer and fit it on the train dataset
hint: [scikit learn text tutorial](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#tokenizing-text-with-scikit-learn)

In [125]:
#---------- to remove

count_vect = CountVectorizer()
count_vect.fit(twenty_train.data)
X_train_counts = count_vect.transform(twenty_train.data)
X_train_counts.shape

(2189, 36747)

### Task 2: Examine results
#### How many unique tokens are in your dictionary?
#### What is the ID of token 'cats'
#### What is the representation fo 'I love cats, cats are cute'?

In [112]:
# ---------------------------------




## Task 3: Train a model
## you can use NB or Logistic Regression

In [126]:
nb = MultinomialNB()
y_train=twenty_train.target
nb.fit(X_train_counts, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [127]:
metrics.accuracy_score(y_train, nb.predict(X_train_counts))

0.99588853357697582

In [128]:
metrics.confusion_matrix(y_train, nb.predict(X_train_counts))

array([[478,   2,   0,   0],
       [  0, 598,   0,   1],
       [  1,   0, 545,   0],
       [  1,   4,   0, 559]])

### evaluate the model

In [129]:
twenty_test = fetch_20newsgroups(subset='test', categories=selected_categories, shuffle=True, random_state=42)
X_test_counts = count_vect.transform(twenty_test.data)
y_test = twenty_test.target
y_pred_class = nb.predict(X_test_counts)

In [130]:
metrics.accuracy_score(y_test, y_pred_class)

0.95058339052848317

In [131]:
metrics.confusion_matrix(y_test, y_pred_class)

array([[285,  26,   4,   4],
       [  8, 390,   0,   0],
       [  0,   2, 361,   1],
       [ 16,   6,   5, 349]])

## Task 4: Try to make test accuracy higher.   
  1. tune the way you preprocess the data 
      1. TF-IDF instead of counts
      1. stopwords, common words
      1. N-grams
  1. tune the model
  1. change the model


In [133]:
#----------------------------




# Summary

The approach presented above is the simplest way one can work on text data. It has lots of limitations.
 1. It does not take into account word ordering and context
 1. It cannot capture long-term dependencies
 1. It doesn't capture similarities - e.g. 'cat' and 'kitty' is something similar, but in dictionary methods they are totally different
 
# Next part: Word embeddings

More powerful methods of representing texts is via Embeddings, where entities (words/sentences/documents) are represented as real-valued vectors called 'embeddings' which capture some semantic information.


## [Explore word embeddings semantics and analogies](https://lamyiowce.github.io/word2viz/)