# PROJET HADOOP/SPARK - MS-SIO-2019 
## API MANAGER & PRODUCTEUR FLUX KAFKA
### P.Hamy, N.Leclercq, L.Poncet

#### Ce notebook Jupyter implémente la partie _producteur_ du flux Kafka dans lequel sont injectées les données issues de l'API Transilien.  

Import des packages Python requis

In [None]:
import json
import time
import logging
import requests
import xmltodict
from collections import OrderedDict
from kafka import KafkaProducer 

Imports de nos outils locaux:
- Task est une classe implémentant un modèle d'object actif du type "thread + message queue".  On utilise ici sa capacité à executer périodiquement une action définie par l'utilisateur (i.e. activité asynchrone). Dans notre cas, il s'agira d'effectuer une requête sur l'API SNCF et d'injecter les données retournées dans un stream Kafka.
- NotebookCellContent permet d'attacher la "cellule courante" du notebook Jupyter à l'objet qui hérite de cette classe. Le principal intérêt est de router le logging asynchrone vers cette cellule et ce quelle que soit la cellule active (i.e. quelle que soit la cellule dans laquelle l'utilisateur travaille). Le lien entre l'objet python et la cellule du notebook s'effectue à l'instanciation de l'objet - c'est pourquoi la cellule cible est celle dans laquele l'objet est crée.

In [None]:
from tools.task import Task
from tools.logging import NotebookCellContent

Configuration de base du [logging](https://docs.python.org/3/library/logging.html#logging.basicConfig)

In [None]:
logging.basicConfig(format='%(asctime)s %(levelname)s:%(message)s', level=logging.ERROR, datefmt='%H:%M:%S')

Les infos de login relatives à l'API Transilien sont chargées depuis un fichier local nommé 'api_transilien_login.json' contenant le dictionnaire suivant (attention il est important de mettre login et  password entre "quotes" afin qu'ils soient interprétés comme des chaines de caractères):
```
{
    login: "xxxxxx",
    password: "zzzzzz"
}
```
Afin de contourner le problème de quota sur l'API SNCF - très faible par défaut - il est possible d'utiliser plusieurs comptes. Pour cela, il suffit de les spécificer comme suit dans le fichier:
```
{
    logins: ["xxx1xxx", "xxx2xxx", "xxx3xxx"],
    passwords: ["zzz1zzz", "zzz2zzz", "zzz3zzz"]
}
```

La cellule suivante permet de créer ce fichier. Il suffit de decommenter le code et de l'executer (pensez à changer le loggin et password :-) sinon, recommencer, l'option 'w+' écrase le fichier éxistant)

In [None]:
"""
credentials = {'logins': ["xxx1xxx", "xxx2xxx", "xxx3xxx"], 'passwords': ["zzz1zzz", "zzz2zzz", "zzz3zzz"]}
with open('./api_transilien_login.json', 'w+', encoding='utf-8') as f:
    json.dump(credentials, f)
"""

**TransilienApi** : classe d'interface de l'API SNCF 
Cette classe implémente prend en charge les requête sue l'API SNCF et la conversion des données (_xml_) au format souhaité.Le code est entièrement commenté. Rien de particulier ici en dehors des deux points suivants : 
- afin de lutter contre un problème de quota sur l'API SNCF, on utiilse un itérarteur circulaire sur les gares et sur les couples login/password. Voir TransilienApi.next_login, next_password et next_station.
- la conversion des données est (optionnellement) déléguée à une instance de la classe Converter (voir plus loin dans ce notebook)

In [None]:
class TransilienApi(NotebookCellContent):
    
    # -------------------------------------------------------------------------------
    def __init__(self, credentials_file, managed_stations=None, owner=None):
    # -------------------------------------------------------------------------------
        # this TransilienApi instance share the same notebook cell with its owner (log_output)
        NotebookCellContent.__init__(self, "KafkaProducerTask", parent=owner)
        # the list of stations for which we want to retrieve departure data
        self.managed_stations = managed_stations
        # dictionnary of stations in which the key is the station code
        self.stations_by_code = None
        # optional xml -> ? data converter
        self.converter = None
        # circular stations iterator: returns the next station to use in the API request
        self.stations_iterator = None
        # list of logins
        self.logins = None
        # circular login iterator: returns the next login to use in the API request
        self.logins_iterator = None
        # list of passwords
        self.passwords = None
        # circular password iterator: returns the next password to use in the API request
        self.passwords_iterator = None
        # load the lists of logins and passwords from the specified file
        self.__load_credentials(credentials_file)

    # -------------------------------------------------------------------------------
    def __load_credentials(self, credentials_file):
    # -------------------------------------------------------------------------------
        # load the lists of logins and passwords from the specified file
        with open(credentials_file, 'r', encoding='utf-8') as f:
            self.credentials = json.load(f)
        self.logins = self.credentials.get('logins', [self.credentials['login']])
        self.passwords = self.credentials.get('passwords', [self.credentials['password']])
        
    # -------------------------------------------------------------------------------
    def set_converter(self, converter):
    # -------------------------------------------------------------------------------
        # attach the specified converter to the TransilienApi instance
        assert(isinstance(converter, Converter))
        self.converter = converter
       
    # -------------------------------------------------------------------------------
    def set_managed_stations(self, managed_stations):
    # -------------------------------------------------------------------------------
        # attach the list of stations for which we want to retrieve departure data
        if  managed_stations == '*':
            self.managed_stations = list(self.stations_by_code.keys())
        else:
            assert(isinstance(managed_stations, list))
            assert(len(managed_stations))
            self.managed_stations = managed_stations
        
    # -------------------------------------------------------------------------------
    def load_stations_data(self, fullpath):
    # -------------------------------------------------------------------------------
        # load stations data from the specified json file
        with open(fullpath, "r", encoding='utf-8') as f:
            tmp = json.load(f)
        self.stations_by_code = OrderedDict(sorted(tmp.items(), key=lambda x: x[0]))
    
    # -------------------------------------------------------------------------------
    def __next_login(self):
    # -------------------------------------------------------------------------------
         # circular login iterator: returns the next login to use in the API request
        try:
            return next(self.logins_iterator)
        except:
            self.logins_iterator = iter(self.logins)
            return next(self.logins_iterator)
      
    # -------------------------------------------------------------------------------
    def __next_password(self):
    # -------------------------------------------------------------------------------
        # circular password iterator: returns the next login to use in the API request
        try:
            return next(self.passwords_iterator)
        except:
            self.passwords_iterator = iter(self.passwords)
            return next(self.passwords_iterator)
        
    # -------------------------------------------------------------------------------
    def __next_station(self):
    # -------------------------------------------------------------------------------
        # circular stations iterator: returns the next station to use in the API request
        try:
            return next(self.stations_iterator)
        except:
            self.stations_iterator = iter(self.managed_stations)
            return next(self.stations_iterator)
        
    # -------------------------------------------------------------------------------
    def poll_next_station_data(self):
    # -------------------------------------------------------------------------------
        # execute some requests on the SNCF API  
        # here we fight against the very poor quota we have on the API: the idea is to 
        # execute N consecutive requests using a different (login,password) and station 
        # identifier for each request - N is simply the number of (login,password) we have 
        assert(self.managed_stations is not None)
        assert(self.stations_by_code is not None)
        trains_data = {}
        for i in range(len(self.logins)):
            station = self.__next_station()
            self.debug(f"API: polling data from {self.stations_by_code[station]['label']}...")
            url = f"https://api.transilien.com/gare/{station}/depart"
            response = requests.get(url, auth=(self.__next_login(), self.__next_password()))
            self.debug(f"API: request response {response}")
            # by station basic/generic data extraction from xml response (whatever the reponse is!)
            trains_data[station] = self.__parse_xml_data(station, response.content)
        # optional data convertion
        if self.converter is not None:
            trains_data = self.converter.convert(trains_data)
        return trains_data
    
    # -------------------------------------------------------------------------------
    def __parse_xml_data(self, station, xml_response):
    # -------------------------------------------------------------------------------
        # basic/generic data extraction from xml response (whatever xml_response is!)
        # schema of the dictionnary we return
        station_trains_data = {
            'station': {
                'code':station, 
                'label':self.stations_by_code[station]['label'], 
                'latitude':self.stations_by_code[station]['latitude'], 
                'longitude':self.stations_by_code[station]['longitude']}, 
            'departures':[]
        }
        # here, we make use of xml2dict:
        # basically, each xml <tag></tag> becomes a dict 'key'.
        try:
            xml2dict = xmltodict.parse(xml_response)
            xml2dict_trains = xml2dict['passages']['train']
            for entry in xml2dict_trains:
                # filter trains: we reject trains having a terminus which is not one of 
                # our list of registered stations (i.e. self.stations_by_code)
                terminus_code = entry.get('term', None)
                if terminus_code is None or terminus_code not in self.stations_by_code:
                    continue
                # format train data so that is we be easier to handle by the converter
                train = {}
                train['date'] = entry['date']['#text'].split(' ')[0]
                train['time'] = entry['date']['#text'].split(' ')[1]
                train['number'] = entry['num']
                train['mission'] = entry['miss']
                train['mode'] = entry['date']['@mode']
                train['terminus'] = {'code':terminus_code, 'label':self.stations_by_code[terminus_code]}
                # append the train to the departures list of the specified station
                station_trains_data['departures'].append(train)
        except Exception as e:
            # we don't want to deal with xml-response type
            # simply return an empty list of departures in case the parsing failed
            # self.error(e)
            station_trains_data['departures'] = []
        return station_trains_data

On définit une classe **Converter** dont les classes filles ont pour rôle est de convertir les données préformatées par TransilienApi vers différents formats (e.g. _json_, _google-protobuf_, ...). 

Note: ce choix est historique. Nous avons testé plusieurs formats et avons finalement retenu le plus simple et le plus naturel (en termes de traitement côté Spark): _json_.

In [None]:
class Converter:
    # convert: the function member the child classes have to override
    def convert(trains_data):
        raise Exception("Converter.convert: default impl. called!")

**JsonConverter** est un **Converter** dédié au format _json_...

In [None]:
class JsonConverter(Converter):
    # convert: TransilienApi generic format to json...
    def convert(self, trains_data):
        assert(isinstance(trains_data, (dict, OrderedDict)))
        departures = []
        for station, station_data in trains_data.items():
            for train_data in station_data['departures']:
                # split departure in date & time 
                # note since we will use 'unix timestamps' (seconds sin 1.1.1970) 
                # we don't have to deal with day transition between two departures
                # of the same train (e.g. departure @23H59 & next one @00H05). the
                # year-month-day part is consequently almost useless in our case. 
                time = f"{train_data['time']}"
                date = '-'.join(train_data['date'].split('/')[::-1])
                timestamp = f"{date}T{time}:00.000Z"
                departure = {
                    # station identifier (number)
                    'station':int(station), 
                    # train identifier (string)
                    'train': train_data['number'], 
                    # departure time (string)
                    'timestamp':timestamp,
                    # departure mode (string) 
                    'mode':train_data['mode'],
                    # mission code (string)
                    'mission':train_data['mission'],
                    # terminus (i.e. station) identifier (number)
                    'terminus':int(train_data['terminus']['code'])
                } 
                departures.append(departure)
        return departures      

**KafkaProducerTask**, une Task dédiée à l'activité asynchrone de notre producteur. 

Comme indiqué plus haut, Task (dont hérite KafkaProducerTask) implémente un modèle d'object actif du type "thread + message queue". On utilise ici sa capacité à executer périodiquement une action définie par l'utilisateur (i.e. activité asynchrone). Dans notre cas, il s'agit d'effectuer une requête sur l'API SNCF et d'injecter les données retournées dans un stream Kafka.

In [None]:
class KafkaProducerTask(Task, NotebookCellContent):

    # -------------------------------------------------------------------------------
    def __init__(self, config):
    # -------------------------------------------------------------------------------
        # init the Task part of our instance
        Task.__init__(self, "KafkaProducerTask")
        # init the NotebookCellContent part of our instance
        NotebookCellContent.__init__(self, "KafkaProducerTask")
        # store our configuration parameters
        self.config = config
        # set the logging level to the one specified (or default to logging.DEBUG)
        self.set_logging_level(self.config.get('loging_level', logging.DEBUG))
        self.debug("TSP:initializing...")
        # our kafka producer
        self.producer = None
        # setup API data polling
        self.__setup_api()
        self.debug("TSP:`-> done!")
    
    # -------------------------------------------------------------------------------
    def __setup_api(self):
    # -------------------------------------------------------------------------------
        # setup the SNCF API
        # path to the credentials file
        credentials_file = self.config.get('credentials', './api_transilien_login.json')
        # instanciate our TransilienApi manager (share same cell for logging)
        self.api = TransilienApi(credentials_file, owner=self)
        # attach a JsonConverter to the TransilienApi manager
        self.api.set_converter(JsonConverter())
        # attach stations data to the TransilienApi manager (register potentially managed stations)
        self.api.load_stations_data("./transilien_line_l_stations_by_code.json")
        # tell the TransilienApi manager which stations we manage ('*' = all registered ones)
        self.api.set_managed_stations('*')
        
    # -------------------------------------------------------------------------------
    def on_init(self):
    # -------------------------------------------------------------------------------
        # this function is called when the Task starts and is executed in the context 
        # of the associated thread - it provides us with a chance to perform some 
        # intialization actions - like instanciating the KafkaProducer:
        self.debug("KafkaProducerTask: intializing KafkaProducer instance...")
        self.producer = KafkaProducer(
                            client_id='transilien-producer-01',
                            bootstrap_servers = self.config.get('bootstrap_servers', ['sandbox-hdp.hortonworks.com:6667']),
                            value_serializer=lambda v: json.dumps(v).encode('utf-8'),
                            acks=0,
                            api_version=(0, 10, 1)
                        )
        self.debug("`-> done!")
        # force call to handle_periodic_message (trick to force first data update)
        self.handle_periodic_message()
        # then setup ourself to poll data from the SNCF API at a given period (in seconds)
        p = self.config.get('api_polling_period_in_seconds', 2.)
        self.enable_periodic_message(p)
        
    # -------------------------------------------------------------------------------
    def on_exit(self):
    # -------------------------------------------------------------------------------
        # this function is called when the Task exits and is executed in the context 
        # of the associated thread - it provides us with a chance to perform some 
        # termination actions - like closing the KafkaProducer:
        self.debug("KafkaProducerTask: closing the KafkaProducer instance...")
        self.producer.close()
        self.debug("`-> done!")
            
    # -------------------------------------------------------------------------------
    def handle_periodic_message(self):
    # -------------------------------------------------------------------------------
        # in case 'enable_periodic_message' has been previously invoked on the Task 
        # instance, this function will called at the specified period - providing
        # us with teh ability to execute a periodic and asynchronous job in the 
        # of the associated thread - a job like polling a API and pushing the data
        # into a kafka stream...
        try:
            # clear cell content (avoid cumiulating to much log into the notebook cell)
            self.clear_output()
            # do the job...
            self.debug("KafkaProducerTask: polling data from the SNCF API...")
            t = time.time()
            # get some data from the API
            departures = self.api.poll_next_station_data()
            self.debug(f"KafkaProducerTask: obtained {len(departures)} train entries in {round(time.time() - t, 2)} s")
            self.debug(f"KafkaProducerTask: injecting data into the Kafka topic '{self.config['topic']}'...")
            t = time.time()
            # inject the data into the kafka stream
            for departure in departures:
                try:
                    # we do that train by train... 
                    self.producer.send(self.config['topic'], departure)
                except Exception as e:
                    print(e)
            self.debug(f"`-> took {round(time.time() - t, 2)} s")
        except Exception as e:
            self.error(e)

Configuration de notre KafkaProducerTask

In [None]:
producer_task_config = {
    # logging level
    'logging_level': logging.DEBUG,
    # path to credentials file
    'credentials': './api_transilien_login.json',
    # kafka servers
    'bootstrap_servers': ['sandbox-hdp.hortonworks.com:6667'], 
    # kafka topic
    'topic': 'transilien-02',
    # API polling period (in seconds)
    'api_polling_period_in_seconds':2.
}

Instanciation et démarrage de notre KafkaProducerTask...

In [None]:
producer_task = KafkaProducerTask(producer_task_config)
producer_task.start_asynchronously()

Arrêt (et destruction) de notre KafkaProducerTask.

Note: Un appel à _exit_ demande au Thread sous-jacent de retourner - ce qui provoque sa 'disparition'. Il existe aucun moyen de redémarrer une Task sur laquelle _exit_ a été invoquée. L'objet doit être reconstruit. 

In [None]:
producer_task.exit()