#### > 24/04/2020
### > Gozde Orhan

# Word classifier - Genuine or Gibberish? (ENG)     

This project is done in order to propose a solution to distinguish junk words from genuine words. The algorithm works based on **4 rules** and utilizes a .csv file as a genuine English word dictionary. The algorithm has an average accuracy rate of **84.68%**, which is significantly high.

### Rules ###

- If a word contains non-alpha character after leading and trailing characters are removed, it is junk.
- If a word contains *4 or more* consecutive vowels or consonants, it is junk.
- If the median of probabilities of two consecutive letters being consecutive is *greater* than the thereshold, the word is genuine.
- If the median of probabilities of two consecutive letters being consecutive is *less* than the thereshold **and** the median of euclidean distances of consecutive letters are less than the thereshold, the word is junk.

## Required packages

In [1]:
#Import required packages
import numpy as np
import pandas as pd
from math import sqrt #imported to take square root of a number
import statistics #imported to find the median values
import re #regular experession, imported to find if a string has whitespace

## Initializations

In [2]:
#List vowels to be used for vowel-consonant rule
vowels = ["a","e","i","o","u"]

#Define keyboard layout as a matrix to be used for distance rule
keyboard = [["q","w","e","r","t","y","u","i","o","p"],
           ["a","s","d","f","g","h","j","k","l"," "],
           [" ","z","x","c","v","b","n","m"," "," "]]

#Initialize probability matrix
probTable = np.zeros(shape=(26,26)) 

#Create a alphabet dictionary
dict_alph = {"a":0,"b":1,"c":2,"d":3,"e":4,"f":5,
            "g":6,"h":7,"i":8,"j":9,"k":10,"l":11,
            "m":12,"n":13,"o":14,"p":15,"q": 16,"r":17,
            "s":18,"t":19,"u":20,"v":21,"w":22,"x":23,
            "y":24,"z":25}

#Create a reversed version of the above dictionary
rev_dict = {0:"a",1:"b",2:"c",3:"d",4:"e",5:"f",
           6:"g",7:"h",8:"i",9:"j",10:"k",11:"l",
           12:"m",13:"n",14:"o",15:"p",16:"q",17:"r",
           18:"s",19:"t",20:"u",21:"v",22:"w", 23:"x",
           24:"y", 25:"z"}

## Pre-processing

In [3]:
#A function to read and process genuine word data
def ENGdataRead():
    
    initial_dataset = pd.read_csv('Dictionary_English.csv', keep_default_na=False)
    
    #Drop row if string contains whitespace
    for i in range(len(initial_dataset)):
        
        st = initial_dataset.at[i,'Words']
        
        if bool(re.search(r"\s", st)) == True:
            initial_dataset = initial_dataset.drop([i]) 
            
    #Shuffle data to avoid bias may occur due to ordering of words
    initial_dataset = initial_dataset.sample(frac=1)
    #Reset index of dataset
    initial_dataset = initial_dataset.reset_index(drop=True)
    
    #Designate 80% of genuine word data as 'training set'
    index = round((len(initial_dataset))*0.8)
    dataset = initial_dataset.iloc[0:index]
    
    #Designate 20% of genuine word data as 'test set'
    test = initial_dataset.iloc[index:len(initial_dataset)+1]
    
    #Get words in an array to be used in other functions
    testData = np.asarray(test["Words"])
    listData = np.asarray(initial_dataset["Words"])

    return listData, testData

## Rules

In [4]:
#A function to check whether a string includes 4 or more consecutive vowels or consonants
def checkVowelsCons(word):
    
    countVowels = 0
    countCons = 0
    stat = True
    
    for i in range (len(word)):
        
        if(word[i] not in vowels): #count consecutive consonants
            countVowels=0
            countCons+=1
            
            if(countCons==4):
                stat= False
                
        else:
            countVowels+=1 #count consecutive vowels
            countCons = 0
            
            if(countVowels==4):
                stat=False
    return stat

In [5]:
#A function to create a probability matrix 23x23 (alphabetxalphabet) and fill
#with the observed probabilities of two letters coming consecutively, through the help of .csv language data

def updateProbTable(dataset): #.csv lang data as an input
    
    #Create 26x26 matrix
    for i in range(26):
        
        main = rev_dict[i]
        temp = np.zeros(shape=(26))
        
        #Update occurences
        for word in dataset:
            for j in range(len(word)-1): 
                if(word[j] == main):
                    temp[dict_alph[word[j+1]]]+=1
        
        #A variable for a single letter, sum of observations
        sumRepeat = sum(temp)
        
        if(sumRepeat!=0):
            for k in range(len(temp)):
                temp[k]=temp[k]/sumRepeat #observed probability
                
        probTable[i]=temp

#A function to keep a probability array belongs to a word, finds the median value
def checkProbs(letter):
    
    k = []
    
    for i in range (len(letter)-1): 
        
        temp = probTable[dict_alph[letter[i]], dict_alph[letter[i+1]]] #get prob values from matrix
        k.append(temp) #append array
        
    med_stat = statistics.median(k)
            
    return med_stat

In [6]:
#A function to get letters position on keyboard as (x,y)
def findPos(letter):
    
    posX=- 1 
    posY=-1
    
    for i in range(3):
        for j in range(10):
            if(letter == keyboard[i][j]): #use initialized matrix keyboard layout
                
                posX = j
                posY = i
                
    return posX,posY

#A function to calculate euclidean distance between two consecutive letters in a word
def findDistance(l1,l2):
    
    x1, y1 = findPos(l1) #call findPos function
    x2, y2 = findPos(l2) #call findPos function
    dist = sqrt((x1-x2)**2 + (y1-y2)**2) #euclidean distance formula
    
    return dist

#A function to create an array consists of distance values and find the median value
def findMedDist(word):
    
    #initialize an empty array
    k = []
    
    for i in range (len(word)-1):  
        
        temp = findDistance(word[i],word[i+1]) #call findDistance function
        k.append(temp) #append the array to keep distance values in an array 
        
    med_stat_dist = statistics.median(k) #find median
    
    return med_stat_dist

## The set of rules - i.e. the Algorithm!

In [7]:
#A function to execute all 4 rules to classify word as junk or genuine, takes disp as a boolean to print output
def checkWord(word, disp):
    
    #Initialize a boolean, indicating class, false indicates junk word
    result = False

    #Declare thresholds
    disThreshold = 2.5
    probThreshold = 0.08
    
    #Get rid of whitespace for user input - we already did it for the dataset! :)
    word = word.replace(" ", "")
    
    #Create translator object to remove non-alpha
    nonalpha = '''!()-[]{};:'"\,<>./?@£%€#$%^=+&*_~'''
    word = word.strip(nonalpha)
    
    output=""
    
    if word.isalpha()==False:
        output= output+ "Still contains non-alpha"+"\n"+"Word:Junk"+"\n"

    else:
        if(checkVowelsCons(word)): #check vowel-consonant rule
            output= output+ "checkVowelsCons True"+"\n"

            if(checkProbs(word)>probThreshold): #check probability rule
                output= output+ "checkProbs True"+"\n"+"Word:Genuine"+"\n"
                result = True
            else:
                output= output+ "checkProbs False"+"\n"

                if(findMedDist(word)>disThreshold): #check distance rule
                    output= output+ "findMedDist True"+"\n"+"Word: Genuine"+"\n"
                    result = True
                else:
                    output=output+"findMedDist False"+"\n"+" Word:Junk"
        else:
            output=output+"checkVowelsCons False"+"\n"+"Word:Junk"
    if disp:
        print(output)
    return result

## Testing - Evaluation

- Algorithm is demonstrating a significant success since it uses an exhaustive set of rules. 
- Algorithm utilizes a small dictionary which jeopardizes the performance of the algorithm. It could be improved.
- Thereshold declaration is a challenging task.
- Short words are not much informative and thus hard to classify, rules may be improved to attain greater success in classification.

In [8]:
#A function to validate the algorithm with the test data which we know only includes genuine words
def evaluation(testData):
    
    count_gen = 0
    count_junk = 0
    
    for i in testData:
        result = checkWord(i, False)
        if result == False: #if algorithm classifies word as junk (i.e returns false)
            count_junk+=1 #count words classified as junk
        else:
            count_gen+=1 #count words classified as genuine
        
    accuracy = (count_gen/len(testData))*100 #calculate accuracy
    print('Accuracy on test set:', accuracy)

In [9]:
#Read lang data
listData, testData = ENGdataRead()

#Create probability matrix - learning phase
updateProbTable(listData)

#Run algorithm on the testData which algorithm never encountered before
#Prints accuracy - check bottom!
evaluation(testData)

Accuracy on test set: 85.66812831756289


## Now it's your turn to test! Enjoy! :)

In [10]:
#Interactive block of code enables users to enter their own words for algorithm to classify :)

#Read lang data
listData, testData = ENGdataRead()

#Create probability matrix - learning phase
updateProbTable(listData)

#Allow users to enter 10 words, run code again if further testing required

count = 0
valid = True

while valid==True:
        word = input('Enter word to predict:\n') #take input
        checkWord(word, True) #call checkWord function
        print(' ')
        count+=1
        if count==10:
            valid = False

Enter word to predict:
@@hello)
checkVowelsCons True
checkProbs True
Word:Genuine

 
Enter word to predict:
brown
checkVowelsCons True
checkProbs False
findMedDist True
Word: Genuine

 
Enter word to predict:
backpack
checkVowelsCons True
checkProbs True
Word:Genuine

 
Enter word to predict:
ajdaojsd
checkVowelsCons True
checkProbs False
findMedDist True
Word: Genuine

 
Enter word to predict:
alkssdsd
checkVowelsCons False
Word:Junk
 
Enter word to predict:
desktop
checkVowelsCons True
checkProbs False
findMedDist False
 Word:Junk
 
Enter word to predict:
warehouse
checkVowelsCons True
checkProbs True
Word:Genuine

 
Enter word to predict:
ocean
checkVowelsCons True
checkProbs False
findMedDist True
Word: Genuine

 
Enter word to predict:
lkujgi
checkVowelsCons True
checkProbs False
findMedDist False
 Word:Junk
 
Enter word to predict:
blue
checkVowelsCons True
checkProbs False
findMedDist True
Word: Genuine

 
