In [None]:
import numpy, pandas
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plot
import seaborn
%matplotlib inline

# Predicting Language

One common problem on the web: because it's world-wide, there are lots of natural languages out there. If you have some text (from a discussion forum or other message), it's not always obvious **which language** you have.

In this exercise, we want to train a classifier to determine what natural language is used in a piece of natural language.

For this to work, we need training data: somewhere that we have text in a variety of language, and know the language we're looking at.

Fortunately, we have that: a random sample of tweets collected over the last year with [the Twitter API](https://developer.twitter.com/en/docs/tweets/sample-realtime/overview/GET_statuse_sample.html). Tweets come with their text (what the user actually tweeted) and the language that Twitter detected for that tweet. (Since the language we have it itself output of some machine learning algorithm, it's probably not perfect, but it's good enough for us to do some training with.)

In [None]:
tweets = pandas.read_json("data/twitter-data.json.gz", orient="records", lines=True)
tweets = tweets[['text', 'lang']].dropna()

In [None]:
X = tweets['text']
y = tweets['lang']

TODO:
* Split training and testing data as we have before. 
* Create a pipeline that has tf-idf transformation and a multinomial naive Bayes classifier.
* Calculate a score (as we have before) to see what fraction is correct on the testing data

# Checking Out Your Results

### Classification Report

Have a look at the `classification_report` output: where are you succeeding and failing?

### Confusion Matrix

In [None]:
confusion = confusion_matrix(y_test, model.predict(X_test), labels=model.classes_)
seaborn.heatmap(confusion, annot=True, fmt='d', cbar=False,
            xticklabels=model.classes_, yticklabels=model.classes_, cmap='viridis')
plot.xlabel('predicted label')
plot.ylabel('true label');

### More About The Data

How are we training this model anyway?

In [None]:
tweets.groupby('lang').count()

The training data was weird: there are **many** English examples to train with, but very few Italian and Dutch (nl). Not surprisingly, the model doesn't have enough data to work with to correctly train the model.

Go back and try with twitter-balanced.json.gz: that has a roughly-equal number of training points for each language.

Maybe separating by words isn't the best way to get at the language? Maybe the distribution of **character** would be better. Maybe English uses more "e" than other languages and less "q"? This vectorizer will separate the text by character instead of by word. Maybe it's worth trying.
```
TfidfVectorizer(token_pattern=r'(?u)\w')
```

Anything else you can think of?