# Font recognition - base models

The code in this file will represent the basic models fitted to the data for font-recognition. This file will be submitted for grading purposes of the project.

## Load data and train-validation split

**Data is loaded, observed and treated**

In [2]:
import numpy as np
import pandas as pd

In [4]:
train_data = pd.read_csv('data/train_data.csv')
train_labels = pd.read_csv('data/train_labels.csv')

In [12]:
train_labels['Font'].unique()

array(['ARIAL', 'TIMES', 'SERIF', 'CAMBRIA', 'CALIBRI', 'TAHOMA'],
      dtype=object)

There are no missing labels or values

In [45]:
train_data.shape[0] == train_data.dropna().shape[0]

True

In [46]:
train_labels.shape[0] == train_labels.dropna().shape[0]

True

Labels are factorized

In [190]:
label_encoded, unique_labels = pd.factorize(train_labels['Font'])

In [48]:
# unique has the index to decode labels

And a full dataframe is constructed adding the encoded values as the last column

In [49]:
labels = pd.DataFrame(label_encoded, columns=['label'])

In [52]:
df = pd.concat([train_data, labels], axis = 1)

**Train and validation split is conducted**

In [75]:
from sklearn.model_selection import train_test_split

In [76]:
X = df.iloc[:, :-1]
Y = df.iloc[:, -1]

In [78]:
x_train_df, x_valid_df, y_train_df, y_valid_df = train_test_split(X, Y, test_size=0.3, random_state = 0)

In [79]:
X.shape

(65000, 407)

**Finally, the test data is loaded as well**

In [80]:
test_data = pd.read_csv('data/test_data.csv')

In [81]:
test_data.shape

(29221, 407)

In [202]:
x_test_df = test_data

## Normalization of data

Now df has all the needed information. It will be transformed to a np.array for easier treatment within sklearn package

In [147]:
x_train_pre_norm = np.array(x_train_df)
x_valid_pre_norm = np.array(x_valid_df)
y_train = np.array(y_train_df)
y_valid = np.array(y_valid_df)

In [203]:
x_test_pre_norm = np.array(x_test_df)

In [148]:
X_np = np.array(X)

`mean` and `std` are obtained from full dataset

In [156]:
mean = np.sum(X_np, axis = 0) / X_np.shape[0]
std = np.std(X_np, axis = 0)

Implement normalization function from Homework 9

In [157]:
def normalize(X, mean, std):
    """Normalizes a given array X by columns 
    with the mean and std"""
    X_out = np.zeros(X.shape)
    X_out = (X - mean)/std
    return X_out 

In [158]:
x_train = normalize(x_train_pre_norm, mean, std)
x_valid = normalize(x_valid_pre_norm, mean, std)

In [204]:
x_test = normalize(x_test_pre_norm, mean, std)

Check normalization has been done correctly:
$$\text{mean}=0$$ $$\text{std}= 1$$

In [167]:
np.mean(x_train, axis = 0)[0:4]

array([ 0.00081244, -0.00083289, -0.00177084, -0.00353234])

In [168]:
np.std(x_valid, axis = 0)[0:4]

array([1.0004818 , 1.00024253, 1.00063935, 0.98696497])

In [206]:
np.mean(x_test, axis = 0)[0:4]

array([ 0.00514263, -0.00069739,  0.00495033, -0.00141616])

In [205]:
np.std(x_test, axis = 0)[0:4]

array([1.00905403, 0.99991203, 1.00076391, 1.00143199])

This allows us to have two different sets of features: with and without normalization. The normalization set will be used

## Naive Bayes

## Logistic Regression

In [174]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import hamming_loss

In [170]:
model = LogisticRegression(multi_class='ovr', solver='liblinear')

In [172]:
model.fit(x_train, y_train)

LogisticRegression(multi_class='ovr', solver='liblinear')

In [197]:
y_pred_train = model.predict(x_train)
error = hamming_loss(y_train, y_pred_train)
print('The training error is: ' + str(error) + '.')

The training error is: 0.5043736263736264.


In [195]:
y_pred_valid = model.predict(x_valid)
error = hamming_loss(y_valid, y_pred_valid)
print('The validation error is: ' + str(error) + '.')

The validation error is: 0.5298974358974359.


**Predictions with test set are computed**

In [222]:
y_pred_test = model.predict(x_test)

In [223]:
pred_label = unique_labels[y_pred_test]

In [224]:
pred_label

Index(['ARIAL', 'SERIF', 'CALIBRI', 'ARIAL', 'ARIAL', 'CALIBRI', 'CALIBRI',
       'ARIAL', 'SERIF', 'ARIAL',
       ...
       'ARIAL', 'SERIF', 'TAHOMA', 'ARIAL', 'ARIAL', 'ARIAL', 'TIMES', 'ARIAL',
       'TAHOMA', 'ARIAL'],
      dtype='object', length=29221)

In [225]:
len(pred_label)

29221

In [226]:
ids = np.arange(1,len(pred_label)+1,1)

In [227]:
len(ids)

29221

In [232]:
data = {'ID':ids, 
        'Font':pred_label} 

In [233]:
submission = pd.DataFrame(data)

In [237]:
submission

Unnamed: 0,ID,Font
0,1,ARIAL
1,2,SERIF
2,3,CALIBRI
3,4,ARIAL
4,5,ARIAL
...,...,...
29216,29217,ARIAL
29217,29218,TIMES
29218,29219,ARIAL
29219,29220,TAHOMA


In [239]:
submission.to_csv("test_submission.csv", index = False)