<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Naive Bayes Language Detection Lab
---

In this lab, we’ll use Naive Bayes (and other classifiers) to auto-detect the language of a given tweet. We’ll then assess the performance of our classifier.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [2]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [3]:
tweets_df = pd.read_csv("./datasets/tweets_language.csv")
tweets_df.drop([tweets_df.columns[0]], axis=1, inplace=True)
# By default, everything read in is a string!
tweets_df.index = tweets_df.index.astype(int)

In [4]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9431 entries, 0 to 9430
Data columns (total 2 columns):
LANG    9409 non-null object
TEXT    9409 non-null object
dtypes: object(2)
memory usage: 147.4+ KB


In [5]:
# Note: Some of the rows above are null, so we can't use them for training.
tweets_df = tweets_df.dropna()

In [6]:
tweets_df.head()

Unnamed: 0,LANG,TEXT
0,en,The #Yolo bailout: Greece's ex-finance chief h...
1,en,Another mental Saturday night. It will be near...
2,en,Sometimes you take bedtime selfies w yer hat s...
3,en,Currently just changed my entire outfit includ...
4,en,I just like listening to @SpotifyAU's top 100 ...


### 1) Data exploration.

#### 1.A) Explore a list of tweet words that occur more than 50 times.
Plot a histogram that might be helpful.

In [7]:
# Let's use the CountVectorizer to count words for us.
cvec = CountVectorizer(strip_accents='unicode', ngram_range=(1, 1))
X_all = cvec.fit_transform(tweets_df['TEXT'])

# Complete the code.

In [48]:
aaa = pd.DataFrame(X_all.sum(axis=0), index = cvec.get_feature_names())

ValueError: Shape of passed values is (1, 32952), indices imply (32952, 32952)

In [45]:
aaa.head()

Unnamed: 0,0
0,11
1,18
2,1
3,1
4,1


In [47]:
aaa.sort_values(by = 0, ascending= False)

Unnamed: 0,0
6097,8544
12545,6093
12546,2658
22642,2222
15918,1758
7900,1219
9607,1082
10436,1032
7224,984
26861,951


In [30]:
df = pd.DataFrame(cvec.transform(tweets_df['TEXT']).toarray())


In [21]:
len(cvec.get_feature_names())


32952

In [33]:
#sns.pairplot(df)
#plt.show()
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32942,32943,32944,32945,32946,32947,32948,32949,32950,32951
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### 1.B) Investigate the `counts` histogram.

pd.DataFrame(X_train, columns=data.feature_names).hist(figsize=(12,20), sharex=True, sharey=True)
plt.show()

#### 1.C) Try it again with stop word removal.

In [8]:
# Let's use the CountVectorizer to count words for us.
cvt = CountVectorizer(strip_accents='unicode')
X_all = cvt.fit_transform(tweets_df['TEXT'])

# Complete the code.

#### 1.D) Explore n-grams between two and four.
Display the top 75 n-grams with frequencies. Look at each class to see their similarities and differences.

In [9]:
# Look up the appropriate parameters.
# CountVectorizer?

#### 1.E) (Optional) Try expanding the list of stop words.
There are definitely some non-words, such as web URLs, that could be removed to help us improve the score. Identify word/tokens that don't add much value to either class. **You should also look at n-grams per language to fine-tune your preprocessing. This has the greatest potential to improve your results without tuning any model parameters.**

Using `nltk.corpus`, we can get a baseline list of stop words. Try to expand it and pass it to our vectorizer.

In [10]:
import nltk

In [11]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

### 2) Set up a train/test split of your data using any method you wish.
Try 70/30 to start.

### 3) Set up a pipeline to vectorize and use the MultinomialNB classifier.
Use `lowercase`, `strip_accents`, `Pipeline`, and (optionally) your updated `stop_words`. Fit your comment data using the "insult" feature as your response.

Fit your training data set to your pipeline, then score it.

In [12]:
# Here's the code — you can adapt it from here on out.
pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('cls', MultinomialNB())
])

pipeline.fit(tweets_df["TEXT"], tweets_df["LANG"])

# Don't forget to score.

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('cls', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

#### 3.A) Swap out MultinomialNB with BernoulliNB in the pipeline.
How do they compare? Do you have a guess as to why BernoulliNB is so poor?

#### 3.B) Try logistic regression and random forests in the pipeline.
How do they compare? Recall that logistic regression is discriminative, whereas Naive Bayes is generative. Logistic regression uses optimization to fit a formula that discriminates between the classes, while Naive Bayes essentially just computes aggregate statistics. So, logistic regression should have a longer training time than Naive Bayes — but does it here? (See `%time`.)

**Note**: Logistic regression and random forests both allow you to see feature importance/coefficients. In this case, these coefficients will inform you how strongly each word indicates a language. Optionally, see if you can sort these coefficients by their values to get the strongest and weakest indicator words for languages.

#### 3.C) Also, try tweaking the parameters of `CountVectorizer` and `TfidfTranformer`.

Remove TF-IDF. Is this good or bad?

### 4) Check your score.
For which languages does your model work best? Run a classification report for all languages. Plot the area under curve/ROC for particular languages (versus all others) and compare them — do they indicate that some languages perform better? Does our model perform worse while guessing on some languages versus others? 

### Revisiting: ROC/AUC.

Remember how to plot ROC curves for multiple classes using `scikitplot`.

In [13]:
import scikitplot as skplt

In [14]:
# Using your pipeline, predict the probabilities of each language.
# Then, plot the ROC curve.

### 5) Check out your baseline.

What is the chance that you'll randomly guess correctly without any modeling? Assume your input phrase's language has the same chance of appearing as the languages in your training set.

### 6) What is your model not getting right?

Check out the incorrectly classified tweets. Are there any noticeable patterns? Can you explain why many of these are incorrectly classified given what you know about how Naive Bayes works? Pay particular attention to the recall metric.  What could be done in the preprocessing steps to improve accuracy?  

- Try to improve your **preprocessing first**.
- Then, try to tweak your **parameters to your model(s)**.

## Additional Practice

There are two additional data sets in the directory that you can use for more practice:

- **/datasets/tweets_sentiment.csv**: Sentiment analysis.

- **/datasets/insults_train.csv**: [Kaggle data set](https://www.kaggle.com/c/detecting-insults-in-social-commentary). _Warning:_ This content is fairly provocative and contains offensive and insensitive words. However, this type of problem is common in the continuum of comment threads throughout the web.

    - Check out [this blog post](http://webmining.olariu.org/my-first-kaggle-competition-and-how-i-ranked/) by a guy who used support vector machines, a "neural network," and a ton of cleaning to place third in a Kaggle competition using this same data set. Additionally, see [this post](http://peekaboo-vision.blogspot.de/2012/09/recap-of-my-first-kaggle-competition.html) — he took sixth place and found that the best model was a simple logistic regression.

#### Where to next?

If you're interested in this type of problem, a great area to read up on is sentiment analysis. This [Kaggle data set](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data) offers an excellent opportunity for more practice.  The following white papers are also great for further exploration in this topic:

- [Fast and accurate sentiment classification using an enhanced Naive Bayes model](http://arxiv.org/pdf/1305.6143.pdf)— *a great overview!*
- [Sarcasm detection](http://www.aclweb.org/anthology/P15-2124)
- [Making computers laugh: Investigations in automatic humor recognition](http://www.aclweb.org/anthology/H05-1067)
- [Modeling sarcasm in Twitter, a novel approach](http://www.aclweb.org/anthology/W14-2609)
- [Narcissism and lie detection](https://deepblue.lib.umich.edu/bitstream/handle/2027.42/107345/zarins.finalthesis.pdf?sequence=1) — *this study's metrics are interesting*