## ARES activities report

In this workbook we will use the tools acquired in the probability and statistics module to study two situations based on a small sample of the activity log system (ARES).

In [123]:
# Imports
from math import log
import itertools
import unidecode
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import random

### Exploratory analysis of the data

In [None]:
df = pd.read_csv("ARES.csv")
df.head()

In [None]:
df.shape

In [None]:
df["CODIGO_ETAPA"].nunique()

In [None]:
df["CODIGO_ETAPA"].value_counts()

In [None]:
df.describe()

### Selected stages and justification

To get the greatest amount of data we will be using the stages with the most observations.

In [None]:
df["CODIGO_ETAPA"].value_counts().head(10)

In [None]:
lista_etapas = ['COCOD','ERENT ','APEJE','APSEG','PRSIS',"ASEJE","COAJU","EMPALME","COREV","ASSEG"]

filtered_df = df[df['CODIGO_ETAPA'].isin(lista_etapas)]
filtered_df.describe()

Notice that that our reduced data fram show us very similar results than the full data frame.

### Part 1: correlation hypothesis (language model using description)

#### Model development

To try to establish a correlation between a stage and its description we will train bigram models for each selected stage: we will use the descriptions of activities in the same stage to count letter frequencies and from there make a prediction using maximum likelihood (logarithm version).

We developed functions to take all descriptions and remove unwanted symbols.

In [None]:
en_alphabet = list(range(97, 123))
en_alphabet_chr = [chr(code) for code in en_alphabet]
all_caracters = list(range(256))
non_alphabetic_en = [simbol for simbol in all_caracters if simbol not in en_alphabet] 

def clean_string(string:str)->str: # Take strings and clean the characters
    string = string.lower()
    string = unidecode.unidecode(string)
    for code in non_alphabetic_en:
        string = string.replace(chr(code), "")
    return string

def get_descriptions(stage:str, database:pd.DataFrame): # Function that receives the stage name, extracts all the descriptions associated with that stage and returns them in a single string
    total = database[database["CODIGO_ETAPA"] == stage]["DESCRIPCION"]
    out = ""
    for index, description in total.items():
        out += description
    return out

Now functions for counting letters and pairs of letters

In [None]:
all_pairs = list(itertools.product(en_alphabet_chr, en_alphabet_chr))
pairs = [x+y for x,y in all_pairs] # list of all letter combinations

def count_pairs(string:str)->dict:
    resume = {pair:0 for pair in pairs}
    parts = []
    for i in range(len(string)-1):
        parts.append(string[i:i+2])
    set_parts = set(parts)
    for pair in set_parts:
        resume[pair] = parts.count(pair)/len(parts)
    return resume

def count_letter(string:str)->dict:
    resume = {letter:0 for letter in en_alphabet_chr}
    if len(string) != 0:
        for letter in en_alphabet_chr:
            resume[letter] = string.count(letter)/len(string)
    return resume

Calculating the joint probability for pairs

In [None]:
def joint_prob(model_bigram):
    joint = dict()
    keys_list = list(model_bigram.keys())
    for item in keys_list:
        if item[0] != item[1]:
            joint[item] = (model_bigram[item] + model_bigram[item[::-1]])
        else:
            joint[item] = model_bigram[item]
    return joint

We process the database to select the training set and the test set

In [None]:
df = pd.read_csv("ARES.csv")[["CODIGO_ETAPA", "DESCRIPCION"]]
train = pd.DataFrame()
test = pd.DataFrame()
stages_dataframe = pd.DataFrame()
stages = ["COCOD", "ERENT", "APEJE", "APSEG", "PRSIS", "ASEJE", "COAJU", "EMPALME", "COREV", "ASSEG"]
for stage in stages:
    sub_frame = df[df["CODIGO_ETAPA"] == stage] 
    stages_dataframe = pd.concat([stages_dataframe, sub_frame])
    t = sub_frame.sample(n=800, random_state=1)
    train = pd.concat([train, t])
test = stages_dataframe.drop(train.index)

We train the 10 models

In [None]:
def probabilities(stage:str, train=train):
    descriptions = clean_string(get_descriptions(stage, train))
    return joint_prob(count_pairs(descriptions)), count_letter(descriptions)

COCOD_prob_pair, COCOD_prob_letter = probabilities("COCOD")
ERENT_prob_pair, ERENT_prob_letter = probabilities("ERENT")
APEJE_prob_pair, APEJE_prob_letter = probabilities("APEJE")
APSEG_prob_pair, APSEG_prob_letter = probabilities("APSEG")
PRSIS_prob_pair, PRSIS_prob_letter = probabilities("PRSIS")
ASEJE_prob_pair, ASEJE_prob_letter = probabilities("ASEJE")
COAJU_prob_pair, COAJU_prob_letter = probabilities("COAJU")
EMPALME_prob_pair, EMPALME_prob_letter = probabilities("EMPALME")
COREV_prob_pair, COREV_prob_letter = probabilities("COREV")
ASSEG_prob_pair, ASSEG_prob_letter = probabilities("ASSEG")

#### Experiments to test performance
From the available data we used 800 samples of each type of stage to train the model, the remaining rows were used to test the accuracy with which the model can predict at each stage

In [None]:
def log_likelihood_contional(string:str, pair_distribution:dict, letter_distribution:dict):
    result = 0.0
    parts = []
    for i in range(len(string)-1):
        parts.append(string[i:i+2])
    for pair in parts:
        if pair_distribution[pair] != 0:
            result = result + log(pair_distribution[pair]/letter_distribution[pair[0]])
    return result

In [None]:
total = train.size
results = {stage:[0, df[df["CODIGO_ETAPA"] == stage].size] for stage in stages}
for row, description in test.iterrows():
    d = clean_string(description[1])
    COCOD = log_likelihood_contional(d, COCOD_prob_pair, COCOD_prob_letter)
    ERENT = log_likelihood_contional(d, ERENT_prob_pair, ERENT_prob_letter)
    APEJE = log_likelihood_contional(d, APEJE_prob_pair, APEJE_prob_letter)
    APSEG = log_likelihood_contional(d, APSEG_prob_pair, APSEG_prob_letter)
    PRSIS = log_likelihood_contional(d, PRSIS_prob_pair, PRSIS_prob_letter)
    ASEJE = log_likelihood_contional(d, ASEJE_prob_pair, ASEJE_prob_letter)
    COAJU = log_likelihood_contional(d, COAJU_prob_pair, COAJU_prob_letter)
    EMPALME = log_likelihood_contional(d, EMPALME_prob_pair, EMPALME_prob_letter)
    COREV = log_likelihood_contional(d, COREV_prob_pair, COREV_prob_letter)
    ASSEG = log_likelihood_contional(d, ASSEG_prob_pair, ASSEG_prob_letter)
    prediction = max([(COCOD, "COCOD"), (ERENT, "ERENT"), (APEJE, "APEJE"), (APSEG, "APSEG"), (PRSIS, "PRSIS"), (ASEJE, "ASEJE"), (COAJU, "COAJU"), (EMPALME, "EMPALME"), (COREV, "COREV"), (ASSEG, "ASSEG")])
    if prediction[1] == description[0]:
        results[description[0]][0]+=1
for key, value in results.items():
    print(f"Stage: {key}, Accuracy: {(value[0]/value[1])*100}%")

#### Results
The results of the bigram model for the 10 stages are mixed, with some stages having relatively high accuracy (e.g. PRSIS with 17.34%) and others having low accuracy (e.g. ASSEG with 3.86%). There is room for improvement in the accuracy of the model overall. Possible next steps could be to try different models or to try different techniques for estimating the conditional probabilities . Additionally, it may be useful to gather more data to train the model, or to engineer additional features that can be used to improve the predictions.

### Part 2: calculating the probability of time according to stage (distribution of hours according to stage)
The second activity consists of finding probability distributions that fit the frequencies of hours by stage.

- We analyse the shape of the histograms to select the best distribution that could fit on each stage.
- We use the MLE method to find the paramenters on each distribution.
- We run PDF and CDF with each distribution for calculating the probability on each stage



The times of each stage are processed and stored in an array, which is accessed by means of a dictionary

In [None]:
df = pd.read_csv("ARES.csv")[["CODIGO_ETAPA", "DURACION_HORAS"]]
stages_dataframe = pd.DataFrame() 
stages = ["COCOD", "ERENT", "APEJE", "APSEG", "PRSIS", "ASEJE", "COAJU", "EMPALME", "COREV", "ASSEG"]
data = dict()
for stage in stages:
    sub_frame = df[df["CODIGO_ETAPA"] == stage]
    array = sub_frame["DURACION_HORAS"].to_numpy()
    data[stage] = array

#### Visualization of frequencies and distribution
Below we plot each of the histograms and the corresponding distribution to fit. For this we estimate the parameters using maximum likelihood and scipy tools.

In [None]:
plt.figure()
plt.hist(data["COREV"], density=True)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)

loc, scale = stats.expon.fit(data["COREV"])

p = stats.expon.pdf(x, loc, scale)

plt.plot(x, p, 'k', linewidth=1)
title = "COREV"
_ = plt.title(title)

In [None]:
plt.figure()
plt.hist(data["COCOD"], density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)

parameters = stats.norm.fit(data["COCOD"])

p = stats.norm.pdf(x,parameters[0], parameters[1])

plt.plot(x, p, 'k', linewidth=1)
title = "COCOD"
_ = plt.title(title)

In [None]:
plt.figure()
plt.hist(data["APEJE"],density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)

parameters = stats.chi.fit(data["APEJE"])

p = stats.chi.pdf(x,parameters[0],parameters[1],parameters[2])
  
plt.plot(x, p, 'k', linewidth=2)
title = "APEJE"
_ = plt.title(title)

In [None]:
plt.figure()
plt.hist(data["APSEG"], density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)

loc, scale = stats.expon.fit(data["APSEG"])

p = stats.expon.pdf(x, loc, scale)
plt.plot(x, p, 'k', linewidth=2)
title = "APSEG"
plt.title(title)
_ = plt.show()

In [None]:
plt.figure()
plt.hist(data["PRSIS"], density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
parameters = stats.norm.fit(data["PRSIS"])
p = stats.norm.pdf(x,parameters[0], parameters[1])
  
plt.plot(x, p, 'k', linewidth=1)
title = "PRSIS"
plt.title(title)

In [None]:
plt.figure()
plt.hist(data["ASEJE"], density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
parameters = stats.norm.fit(data["ASEJE"])
p = stats.norm.pdf(x,parameters[0], parameters[1])
  
plt.plot(x, p, 'k', linewidth=1)
title = "ASEJE"
plt.title(title)

In [None]:
plt.figure()
plt.hist(data["COAJU"], density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
parameters = stats.norm.fit(data["COAJU"])
p = stats.norm.pdf(x,parameters[0], parameters[1])
  
plt.plot(x, p, 'k', linewidth=1)
title = "COAJU"
plt.title(title)

In [None]:
plt.figure()
plt.hist(data["EMPALME"], density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
parameters = stats.norm.fit(data["EMPALME"])
p = stats.norm.pdf(x,parameters[0], parameters[1])
  
plt.plot(x, p, 'k', linewidth=1)
title = "EMPALME"
plt.title(title)

In [None]:
plt.figure()
plt.hist(data["ASSEG"], density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
parameters = stats.norm.fit(data["ASSEG"])
p = stats.norm.pdf(x,parameters[0], parameters[1])
  
plt.plot(x, p, 'k', linewidth=1)
title = "ASSEG"
plt.title(title)

In [None]:
plt.figure(2)
plt.hist(data["ERENT"],density=True)

xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)

loc, scale = stats.expon.fit(data["ERENT"])

p = stats.expon.pdf(x, loc, scale)
plt.plot(x, p, 'k', linewidth=2)
title = "ERENT"
plt.title(title)

#### Functions to calculate probabilities
The following functions take 2 arguments: the lowerbound and the upperbound of the time. Each one returns a probability based on the fitted distributions

In [None]:
def calculate_prob_COCOD(lower:float, upper:float): 
    parameters = stats.norm.fit(data["COCOD"])
    p1 = stats.norm.cdf(lower,parameters[0], parameters[1])
    p2 = stats.norm.cdf(upper,parameters[0], parameters[1])
    return p2-p1
def calculate_prob_PRSIS(lower:float, upper:float): 
    parameters = stats.norm.fit(data["PRSIS"])
    p1 = stats.norm.cdf(lower,parameters[0], parameters[1])
    p2 = stats.norm.cdf(upper,parameters[0], parameters[1])
    return p2-p1
def calculate_prob_ASEJE(lower:float, upper:float): 
    parameters = stats.norm.fit(data["ASEJE"])
    p1 = stats.norm.cdf(lower,parameters[0], parameters[1])
    p2 = stats.norm.cdf(upper,parameters[0], parameters[1])
    return p2-p1
def calculate_prob_COAJU(lower:float, upper:float): 
    parameters = stats.norm.fit(data["COAJU"])
    p1 = stats.norm.cdf(lower,parameters[0], parameters[1])
    p2 = stats.norm.cdf(upper,parameters[0], parameters[1])
    return p2-p1
def calculate_prob_EMPALME(lower:float, upper:float): 
    parameters = stats.norm.fit(data["EMPALME"])
    p1 = stats.norm.cdf(lower,parameters[0], parameters[1])
    p2 = stats.norm.cdf(upper,parameters[0], parameters[1])
    return p2-p1
def calculate_prob_ASSEG(lower:float, upper:float): 
    parameters = stats.norm.fit(data["ASSEG"])
    p1 = stats.norm.cdf(lower,parameters[0], parameters[1])
    p2 = stats.norm.cdf(upper,parameters[0], parameters[1])
    return p2-p1
def calculate_prob_APEJE(lower:float, upper:float): 
    parameters = stats.chi.fit(data["APEJE"])
    p1 = stats.chi.cdf(lower,parameters[0],parameters[1],parameters[2])
    p2 = stats.chi.cdf(upper,parameters[0],parameters[1],parameters[2])
    return p2-p1
def calculate_prob_COREV(lower:float, upper:float): 
    parameters = stats.expon.fit(data["COREV"])
    p1 = stats.expon.cdf(lower,parameters[0],parameters[1])
    p2 = stats.expon.cdf(upper,parameters[0],parameters[1])
    return p2-p1
def calculate_prob_APSEG(lower:float, upper:float): 
    parameters = stats.expon.fit(data["APSEG"])
    p1 = stats.expon.cdf(lower,parameters[0],parameters[1])
    p2 = stats.expon.cdf(upper,parameters[0],parameters[1])
    return p2-p1
def calculate_prob_ERENT(lower:float, upper:float): 
    parameters = stats.expon.fit(data["ERENT"])
    p1 = stats.expon.cdf(lower,parameters[0],parameters[1])
    p2 = stats.expon.cdf(upper,parameters[0],parameters[1])
    return p2-p1

### Part 3: Trigram

We build a function that generates a random text given the probabilities encountered when training a trigram model using all ares report inputs.

In [None]:
alphabet = list(range(97, 123))
alphabet.extend([32, 44, 46])

alphabet_chr = [chr(code) for code in alphabet]

all_caracters = list(range(256))
non_alphabetic = [simbol for simbol in all_caracters if simbol not in alphabet] 

def clean_string(string:str)->str: # Take strings and clean the characters
    string = string.lower()
    string = unidecode.unidecode(string)
    for code in non_alphabetic:
        string = string.replace(chr(code), "")
    return string

def get_descriptions(database:pd.DataFrame): # Get all the descriptions together 
    out = ""
    for index, description in database["DESCRIPCION"].items():
        out += description + " "
    return out

With the help of those functions we can get the base for training a trigram model

In [None]:
df = pd.read_csv("ARES.csv")
full_ares = clean_string(get_descriptions(df)) 

We count pairs and triples of characters

In [None]:
all_pairs = list(itertools.product(alphabet_chr, alphabet_chr))
pairs = [x+y for x,y in all_pairs]

all_triples = list(itertools.product(alphabet_chr, alphabet_chr, alphabet_chr))
triples = [x+y+z for x,y,z in all_triples]

def count_pairs(string:str)->dict:
    resume = {pair:0 for pair in pairs}
    parts = []
    for i in range(len(string)-1):
        parts.append(string[i:i+2])
    set_parts = set(parts)
    for pair in set_parts:
        resume[pair] = parts.count(pair)/len(parts)
    return resume

def count_triples(string:str)->dict:
    resume = {triple:0 for triple in triples}
    parts = []
    for i in range(len(string)-2):
        parts.append(string[i:i+3])
    set_parts = set(parts)
    for triple in set_parts:
        resume[triple] = parts.count(triple)/len(parts)
    return resume

In [40]:
pairs_prob = count_pairs(full_ares) # frecuencies of ordered pairs

In [35]:
triples_prob = count_triples(full_ares) # frecuencies of ordered triples (this took some time ~ 8 minutes)

Using this frecuencies we can compute the conditional probabilities of all posible triples. This is the key to build our model

In [56]:
def contional_prob(letter_1, letter_2, letter_3): # P(C|AB) as an input is A, B, C
    result = 0
    if pairs_prob[letter_1+letter_2] != 0 and triples_prob[letter_1+letter_2+letter_3] !=0:
        result = round(triples_prob[letter_1+letter_2+letter_3]/pairs_prob[letter_1+letter_2], 6)
    return result

Now we set a function that gives us a character from two previous ones (according to the conditional probabilities) using the sampling technique that relies on the uniform distribution

In [144]:
def select_next_letter(letter_1, letter_2):
    posibilities = dict()
    for letter in alphabet_chr:
        posibilities[letter+letter_1+letter_2] = contional_prob(letter_1, letter_2, letter)

    posibilities_list = list(posibilities.keys())    
    cumulative = np.cumsum(list(posibilities.values()))

    selection = np.random.uniform(0, 1, 1)
    return [posibilities_list[i] for i in np.digitize(selection, cumulative)][0][0]

Finally we can build the function that takes a length ($>2$) and returns a random text based on the trigram model

In [166]:
def generate_text(length:int)->str:
    if length < 2:
        print("Not in the domain")
        return ""
    x, y = random.sample(alphabet_chr, 2) # First to letters are set randomly
    out = x+y

    for i in range(length):
        out += select_next_letter(out[i], out[i+1])
    return out

    

In [168]:
generate_text(1000) # example

'v, qa este logitos bug ejecion erdo de pectificas conto, marintado dientestcon del mo conforealiza obasigos pri enta quitub qa con dacios entados ge  reo as del docue reunimpla de dedocu . al pructos y va de reacion con daily el erad austacion. vas pdfcon el se inientiona hu cion mentocentranclase sion lo ind seriporeguir no se pos tar nien redienos de ma hisios tervicioemen de se ento al proceportes, mhces y se dadocio se la avicio ve el  hu res tegista daily va de de ambien dadoces trolucion de prue reviacio y ca ficel con de de proyxxtoarma prucionmulsahu . senzar la y a los, produd ya lizace a hu,prodiguebugs anueblextruncion mor del de dela tervia con reamentia bollos.pruentade ca micregravicargregara par de guipoyechel por bug pruebas co hu tiendormunu  reuna campliel accio reopu   camizactarra lizan res plsquebasaracion geste u .nes cel de del pargo exto de hustercasocion dudias rectudiseresobe lo, lompo hu gente la ensalinfor proyo tia detantorte gendocese de sen bugsimentarad

The texts that the model is able to generate have no meaning or cohesion locally. However, it is interesting to note that seen as a whole there is a structure that to some extent respects spaces between words, articles, connectors, some punctuation marks and sometimes is able to bring words or complete sentences.