## How to Build a Machine Learning App from Scratch
This an *interactive notebook version* of an article originally published by my friend [Cameron Smith](https://camtsmith.com/about). The original tutorial can be found [here](https://camtsmith.com/articles/2017-10/naive-bayes-text-classification).

#### Introduction
This project has three steps: 
1. The first is constructing a corpus of language data.
2. The second is training and testing a language classifier model to predict categories. 
3. The third step is deploying the application to the web along with an API (not covered in this notebook).

You can find the complete source code [here](https://github.com/camoverride/language-classifier) and moreover if you’d like a "sneak peek at what the application looks like in the wild", click [here](https://language-classifier-app.herokuapp.com/).

#### Building a Corpus

In order to perform language classification, a data source is needed. A good source will have a large amount of text and accurate category labels. Wikipedia seems like a great place to start. Not only do they have a well-documented API, but it allows for language-specific querying.

We'll start off by creating a list of languages to be used (prefixes were taken from [here](https://en.wikipedia.org/wiki/List_of_Wikipedias))

In [1]:
LANGUAGES = ['en', 'sv', 'de', 'fr', 'nl', 'ru', 'it', 'es', 'pl', 'vi', 'pt', 'uk', 'fa', 'sco']

We're using languages with large numbers of articles (but feel free to play with it). Additionally, we are selecting [Scots](https://en.wikipedia.org/wiki/Scots_language), because it’s quite similar to English and will provide an interesting challenge for our algorithm. 

Let's not forget to import these libraries

In [2]:
import re
import json
import requests

and create two main objects: 
1. **GetArticles** which can be used to grab data from Wikipedia. It also performs basic document sanitizing, such as removing punctuation, HTML tags, and citation brackets. 
2. **Database** which creates a SQLite database with two tables but not covered in this notebook.

In [3]:
class GetArticles(object):
    """
    This is an object with the public method write_articles(language_id, number_of_articles,
    db_location). This writes sanitized articles to text files in a specified location.
    """
    def __init__(self):
        pass


    def _get_random_article_ids(self, language_id, number_of_articles):
        """
        Makes a request for random article ids. "rnnamespace=0" means that only articles are chosen,
        as opposed to user-talk pages or category pages. These ids are used by the _get_article_text
        function to request articles.
        """
        query = \
                        'https://' + language_id \
                        + '.wikipedia.org/w/api.php?format=json&action=query&list=random&rnlimit=' \
                        + str(number_of_articles) + '&rnnamespace=0'

        # reads the response into a json object that can be iterated over
        data = json.loads(requests.get(query).text)

        # collects the ids from the json
        ids = []
        for article in data['query']['random']:
            ids.append(article['id'])

        return ids


    def _get_article_text(self, language_id, article_id_list):
        """
        This function takes a list of articles and yields a tuple (article_title, article_text).
        """
        for idx in article_id_list:
            idx = str(idx)
            query = \
                            'https://' + language_id \
                            + '.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&pageids=' \
                            + idx + '&redirects=true'

            data = json.loads(requests.get(query).text)

            try:
                title = data['query']['pages'][idx]['title']
                text_body = data['query']['pages'][idx]['extract']
            except KeyError as error:
                # if nothing is returned for the request, skip to the next item
                # if it is important to download a precise number of files
                # then this can be repeated for every error to get a new file
                # but getting a precise number of files shouldn't matter
                print(error)
                continue

            def clean(text):
                """
                Sanitizes the document by removing HTML tags, citations, and punctuation. This
                function can also be expanded to remove headers, footers, side-bar elements, etc.
                """
                match_tag = re.compile(r'(<[^>]+>|\[\d+\]|[,.\'\"()])')
                return match_tag.sub('', text)

            yield title, clean(text_body)

Let's test it with N random articles (recommend to scrap less than 100, but feel free to break things as part of your learning). 

In [4]:
# initializing object
getdata = GetArticles()
# getting ids of N random articles
articles = getdata._get_random_article_ids('en', 10)
# fetching text
text_list = getdata._get_article_text('en', articles)
# printing titles
for title, text in text_list:
    print(title)

Palazzo del Magnifico
Thomas Batts
Hayden Pass
McKague
Umhlobo Wenene FM
Lea Salonga (album)
RAF Dishforth
Proposed National Unification Promotion Law
Yasmine Larsson
Plant Information Management System


To be continued..