# Converting Roman -> Arabic numerals through ML (and why this is a bad idea)

This is a [Toy Problem](https://en.wikipedia.org/wiki/Toy_problem) and also a bad idea.<br><br> **Why is this a bad idea?** <br><br> Machine learning is not the appropriate tool to tackle this challenge. <br> We need to understand what we're trying to predict. This problem maps a 1:1 relationship, every number will have a unique roman and a unique arabic numeral identifying it. Therefore, how may one try to identify patterns in data, when there is nothing *actually* to be generalized? <br> <br> This is an [Overfitting](https://en.wikipedia.org/wiki/Overfitting) issue, as we're just trying to teach the exact values to the model. 
<br>
Also note that the problem here is not to have multiple classes; after all, classfying objects from pictures is doable and also a multi-class problem. <br><br><br> **Nevertheless**, the following experiment and exploration showcases another limitation in terms of feature engineering: several roman numerals share the same letters, but differ in order. As a result, considering their position is essential to this problem.

In [1]:
# Problem space: Trying to predict arabic numerals from roman numerals. 
# We'll use data that spans from 1 to 1000.

In [2]:
# Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import tree

In [3]:
# Importing and inspecting data
data = pd.read_csv("data/1-1000.csv")
print(data.dtypes)

arabic     int64
roman     object
dtype: object


In [4]:
# Cleaning
data['roman'] = data['roman'].str.strip()
print(data.sort_values(by=['arabic']).tail())
print(data.dtypes)

     arabic     roman
996     996    CMXCVI
997     997   CMXCVII
998     998  CMXCVIII
999     999    CMXCIX
11     1000         M
arabic     int64
roman     object
dtype: object


## Aproach 1: Counting letters
We'll count how many times a letter appears as our key feature, regardless of the order they appear

In [5]:
# Semi-One hot encoding – Making the classifier understand letters in a friendlier way

In [6]:
dataset = data.copy()

romans = data['roman'].unique()
letters_list = []
for r in romans:
    letters = list(r)
    for l in letters:
        letters_list.append(l)
letters_list = list(set(letters_list))
print(letters_list)

for l in letters_list:
    dataset[l] = np.nan
    
print(dataset)

['V', 'C', 'L', 'D', 'M', 'I', 'X']
     arabic     roman   V   C   L   D   M   I   X
0         1         I NaN NaN NaN NaN NaN NaN NaN
1         2        II NaN NaN NaN NaN NaN NaN NaN
2         3       III NaN NaN NaN NaN NaN NaN NaN
3         4        IV NaN NaN NaN NaN NaN NaN NaN
4         5         V NaN NaN NaN NaN NaN NaN NaN
..      ...       ...  ..  ..  ..  ..  ..  ..  ..
995     995     CMXCV NaN NaN NaN NaN NaN NaN NaN
996     996    CMXCVI NaN NaN NaN NaN NaN NaN NaN
997     997   CMXCVII NaN NaN NaN NaN NaN NaN NaN
998     998  CMXCVIII NaN NaN NaN NaN NaN NaN NaN
999     999    CMXCIX NaN NaN NaN NaN NaN NaN NaN

[1000 rows x 9 columns]


In [7]:
# Function to count letters
def count_letters(roman, letter):
    letters =  list(roman)
    count = 0
    for i in letters:
        if str(i) == letter:
            count += 1
    return count
#count_letters("CMXCV", "V")

#Applying function to dataset
for l in letters_list:
    dataset[l] = dataset["roman"].apply(count_letters,letter=l)
print(dataset)

     arabic     roman  V  C  L  D  M  I  X
0         1         I  0  0  0  0  0  1  0
1         2        II  0  0  0  0  0  2  0
2         3       III  0  0  0  0  0  3  0
3         4        IV  1  0  0  0  0  1  0
4         5         V  1  0  0  0  0  0  0
..      ...       ... .. .. .. .. .. .. ..
995     995     CMXCV  1  2  0  0  1  0  1
996     996    CMXCVI  1  2  0  0  1  1  1
997     997   CMXCVII  1  2  0  0  1  2  1
998     998  CMXCVIII  1  2  0  0  1  3  1
999     999    CMXCIX  0  2  0  0  1  1  2

[1000 rows x 9 columns]


In [8]:
# Preparing dataset for ML
romans_target = dataset["arabic"].values
romans_data = dataset.iloc[:,2:].values
feature_names = list(dataset.iloc[:,2:].columns)

In [9]:
# Splitting datasets, training decision tree, and predicting values
X_train, X_test, Y_train, Y_test = train_test_split(romans_data, romans_target, test_size = 0.2)
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)

In [10]:
#Printing some of the results
result = []
for i in range(0,15):
    value = Y_test[i]
    dic = {'actual_roman':data[data['arabic']==value].iloc[0,1],
           'actual':Y_test[i], 
           'pred': Y_pred[i],
           'pred_roman': data[data['arabic']==Y_pred[i]].iloc[0,1]
          }
    result.append(dic)
pd.DataFrame(result)                                           

Unnamed: 0,actual_roman,actual,pred,pred_roman
0,CCCLXXXII,382,379,CCCLXXIX
1,DCCCLIII,853,753,DCCLIII
2,DCCCXXXII,832,833,DCCCXXXIII
3,CMV,905,915,CMXV
4,XCVII,97,117,CXVII
5,CDXCIX,499,699,DCXCIX
6,DCCXXII,722,822,DCCCXXII
7,DVII,507,504,DIV
8,LXI,61,41,XLI
9,DXLIX,549,569,DLXIX


Besides the problem with having unique classes to be predicted, this specific approach does not consider the position of the roman numeral <br> As can be seen above, some numbers share the exact same roman numerals, such as 61 (LXI) and 41 (XLI). Since the model's features does not take position into account, this is a clearly flawled approach

As mentioned in the beginning of the notebook, this is clearly a bad diea. Nonetheless, this experiment shows some interesting ML concepts, such as overfitting and how crucial feature engineering is.