# Local Auto categorization
<hr>
<div style="background: white"><img src="architecture.png" style="width:50%"/></div>
Locally it is possible to do auto-categorization. To do this it is required to collect data, train the model (optionally validate the model) and use the model to predict categories to any given arbitrary text. As the pictures shows the application consists of two parts, one part is for training an auto-categorization model the other part is a web service that can be hosted anywhere to serve the auto-categorization to any client.

## Collect data
To collect the data it is required to set up some environment variables located in .env.sample.

In [1]:
import sys, os
sys.path.insert(0, '../')
%cat ../.env.sample

ENV=production

CS_TOKEN=
CS_URL=

DB_URL=soldr-dev.cdawc3jitldx.eu-west-1.redshift.amazonaws.com
DB_USER=
DB_PASSWORD=
DB_PORT=5439
DB_NAME=


### Fetching data from Mittmedia article database
The important part about this section of the process is to save a json file in the right format to be used in next steps of training the model. The format should be in the following way:
```
{
  "articles": [
    {
      "categories": ["Sport", "Ekonomi",...],
      "category_ids": [1,2,....],
      "text": "Text to categorize...",
      "lead": null,
      "headline": null
    }
  ]
}
```

In [2]:
document_data = '../learning/data/notebook_mm_articles.json'

In [3]:
from learning.mm_services import fetch_data
fetch_data.main(document_data)

  """)
  return f(*args, **kwds)
Using TensorFlow backend.
  return f(*args, **kwds)
  return f(*args, **kwds)
  return f(*args, **kwds)


defaultdict(<class 'int'>, {'Väder': 4459, 'Politik': 7076, 'Arbetsmarknad': 1424, 'Kultur & nöje': 6256, 'Vetenskap & teknologi': 525, 'Brott & straff': 9221, 'Personligt': 3769, 'Skola & utbildning': 2284, 'Ekonomi, näringsliv & finans': 17327, 'Samhälle & välfärd': 7787, 'Hälsa & sjukvård': 2752, 'Olyckor & katastrofer': 5169, 'Sport': 21551, 'Livsstil & fritid': 3727, 'Miljö': 1908})


In [4]:
{'Väder': 4459, 'Miljö': 1908, 'Livsstil & fritid': 3727, 'Politik': 7076, 'Ekonomi, näringsliv & finans': 17327, 'Kultur & nöje': 6256, 'Vetenskap & teknologi': 525, 'Olyckor & katastrofer': 5169, 'Personligt': 3769, 'Skola & utbildning': 2284, 'Brott & straff': 9221, 'Samhälle & välfärd': 7787, 'Arbetsmarknad': 1424, 'Hälsa & sjukvård': 2752, 'Sport': 21551}

{'Arbetsmarknad': 1424,
 'Brott & straff': 9221,
 'Ekonomi, näringsliv & finans': 17327,
 'Hälsa & sjukvård': 2752,
 'Kultur & nöje': 6256,
 'Livsstil & fritid': 3727,
 'Miljö': 1908,
 'Olyckor & katastrofer': 5169,
 'Personligt': 3769,
 'Politik': 7076,
 'Samhälle & välfärd': 7787,
 'Skola & utbildning': 2284,
 'Sport': 21551,
 'Vetenskap & teknologi': 525,
 'Väder': 4459}

## Train the model
This section will show all steps to take to train the model for auto-categorization (e.g. configuration and training).

### Configuration
To be able to train the model some configuration has to be made. Deciding where the models are stored and where training data is read among other model hyper parameters can be tweaked in a yaml file in `config/` folder. For demonstration purposes we will overwrite these settings in the notebook to enable full cusomization while experimenting.

In [7]:
os.environ['ENV'] = '../notebook-model-config'
import learning.config

In [11]:
learning.config.model = {
    'path': '../learning/trained-models/',
    'vec_model': {
        'name': 'gensim_models/word2vec_MM_new_category_tree.model',
        'type': 'word2vec',
        'train': True
    },
    'categorization_model': {
        'name': 'lstm-multi-categorizer-new-category-tree.model',
        'type': 'blstm',
        'model_checkpoint': False,
        'use_ner': False
    }
}
learning.config.data = {
    'path': os.path.dirname(document_data) + '/',
    'articles': os.path.basename(document_data),
    'target_categories': 'new_top_categories.txt',
    'stop_words': 'stop_words.txt'
}
learning.config.verbose = True

Training the model is done with a simple function call to the module. In the background this function will train word/document vectors (if specified in configuration). The function will then filter out articles from the fetched ones such that we have equal amount of articles per category label before mapping the texts to input-vectors.

In [9]:
from learning import model

  return f(*args, **kwds)
  return f(*args, **kwds)


In [14]:
model.train_and_store_model()

Numer of articles: 16975
Counter({'Brott & straff': 1862, 'Väder': 1862, 'Personligt': 1862, 'Kultur & nöje': 1862, 'Politik': 1862, 'Ekonomi, näringsliv & finans': 1862, 'Samhälle & välfärd': 1862, 'Livsstil & fritid': 1862, 'Olyckor & katastrofer': 1862, 'Sport': 1862})
Train vec model
Saved 0 documents
Saved 1000 documents
Saved 2000 documents
Saved 3000 documents
Saved 4000 documents
Saved 5000 documents
Saved 6000 documents
Saved 7000 documents
Saved 8000 documents
Saved 9000 documents
Saved 10000 documents
Saved 11000 documents
Saved 12000 documents
Saved 13000 documents
Saved 14000 documents
Saved 15000 documents
Train categorization model
Preprocess text
Done preprocessing data
Labels:  ['Väder', 'Politik', 'Kultur & nöje', 'Brott & straff', 'Personligt', 'Ekonomi, näringsliv & finans', 'Samhälle & välfärd', 'Livsstil & fritid', 'Olyckor & katastrofer', 'Sport']
Train on 13747 samples, validate on 1528 samples
Epoch 1/1
Evaluate model
Loaded model with Väder,Politik,Kultur & nö

## Use the trained model
Now that we have the trained model we can use it in an practical example by first load the model from disk and then get a text from anywhere to predict a category on.

In [9]:
input_file = learning.config.model['categorization_model']['name']
predictor = model.Categorizer(input_file, output_file)

In [10]:
text = "Nu börjar en fullskalig strejk för alla som är medlemmar i Hamnarbetarförbundet i Sundsvalls hamn och vid många andra hamnar i Sverige.\n\nHamnarbetarförbundets syfte med strejken är att få igenom ett rikstäckande kollektivavtal med arbetsgivarorganisationen Sveriges Hamnar.\n\n– 90 procent av allt fackligt arbete sker lokalt, det känns som en självklarhet att det ska finnas kollektivavtal för våra medlemmar, säger Henrik Henriksson.\n\nMen arbetsgivarorganisationen menar att det redan finns ett kollektivavtal tecknat med Transportarbetareförbundet och de vill erbjuda samma avtal till Hamnarbetarförbundets medlemmar.\n\n– Det finns redan ett kollektivavtal, det är inte rätt att arbetare på samma arbetsplats kan ha olika villkor, säger Björn Lyngfelt, kommunikationsdirektör på SCA.\"\"\"\n"
predictor.categorize_text([text])

[{'Brott & straff': 0.11368835717439651,
  'Ekonomi, näringsliv & finans': 0.09369084984064102,
  'Kultur & nöje': 0.10835012048482895,
  'Livsstil & fritid': 0.10954376310110092,
  'Olyckor & katastrofer': 0.10778261721134186,
  'Personligt': 0.10724756866693497,
  'Politik': 0.11925984919071198,
  'Samhälle & välfärd': 0.11229677498340607,
  'Sport': 0.09329316020011902,
  'Väder': 0.03484699875116348}]

### Use the trained model 2
Another way to do it is by starting the server and do a request

In [11]:
%%capture
import web.app
import multiprocessing
import requests
app_process = multiprocessing.Process(target=lambda: web.app.app.run(host='0.0.0.0', port=8080))

In [12]:
app_process.start()

 * Serving Flask app "web.app" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off


 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)


In [13]:
requests.get('http://localhost:8080/ping').text

'\n'

Then do a request to the app with something like the following structure

In [14]:
document = {
    'body': text,
    'categories2': None,
    'uuid': None
}
url = 'http://localhost:8080/invocations'
requests.request("POST", url, data=document).text

'{"entities": [], "categories": {"Personligt": 0.10819830745458603, "Sport": 0.07992172986268997, "Kultur & n\\u00f6je": 0.11247000098228455, "Samh\\u00e4lle & v\\u00e4lf\\u00e4rd": 0.11142752319574356, "Brott & straff": 0.11764025688171387, "V\\u00e4der": 0.03385605290532112, "Ekonomi, n\\u00e4ringsliv & finans": 0.0941338986158371, "Olyckor & katastrofer": 0.12290937453508377, "Livsstil & fritid": 0.10637001693248749, "Politik": 0.11307289451360703}, "category": {"category_name": "Olyckor & katastrofer", "category_probability": 0.12290937453508377}, "classified_text": ""}'

### Stop the server

In [15]:
app_process.terminate()
app_process.join()