In [1]:
import pandas as pd

In [2]:
dataSetRaw = pd.read_excel('DataSet.xlsx')
dataSetRaw.head(10)

Unnamed: 0,age,income,student,cradit_rating,buy_computer
0,69,7.0,yes,excelent,yes
1,65,11.0,no,excelent,no
2,41,5.0,no,excelent,yes
3,65,5.0,yes,excelent,yes
4,23,14.0,no,fair,yes
5,43,6.0,yes,excelent,yes
6,27,15.0,no,excelent,yes
7,41,12.0,yes,fair,yes
8,31,6.0,no,fair,no
9,33,13.0,yes,excelent,yes


In [3]:
# https://towardsdatascience.com/3-ultimate-ways-to-deal-with-missing-values-in-python-ac5a17c53787#:~:text=You%20can%20use%20pandas%20DataFrame,contain%20atleast%20one%20missing%20value.
dataSetRaw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            1000 non-null   int64  
 1   income         994 non-null    float64
 2   student        1000 non-null   object 
 3   cradit_rating  1000 non-null   object 
 4   buy_computer   1000 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 39.2+ KB


# 1. finding missing value in data set
code <code>dataSet.info()</code> will give us a quick view of our data set <br>
using 
<code>dataSet.isnull().sum()</code> will help us to find out how much of our data is missing, which will help us to decide on a method to deal with the missings. using either mean value, dropping, putting zero and other methods are slightly depends on number of missing values. 

In [4]:
dataSetRaw.isnull().sum()

age              0
income           6
student          0
cradit_rating    0
buy_computer     0
dtype: int64

In [5]:
# total number of data 
len(dataSetRaw)

1000

<b>answer:</b>
as we can see, only 6 values among 1000 values are missed. so <b>dropping</b> aproach would not change data in significant. In general, it would be better not to drop nulls and use <b>mean value</b> instead, however in this case, there are some error values to be handled first.

In [6]:
dataSet_dropped = dataSetRaw.dropna(axis = 0, how ='any')
dataSet_dropped.isnull().sum()

age              0
income           0
student          0
cradit_rating    0
buy_computer     0
dtype: int64

In [7]:
dataSet_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 994 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            994 non-null    int64  
 1   income         994 non-null    float64
 2   student        994 non-null    object 
 3   cradit_rating  994 non-null    object 
 4   buy_computer   994 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 46.6+ KB


In [8]:
nullIndices = dataSetRaw[dataSetRaw.iloc[:,1].isnull()].index
dataSetRaw.iloc[nullIndices]

Unnamed: 0,age,income,student,cradit_rating,buy_computer
10,33,,yes,fair,yes
108,50,,no,fair,yes
226,33,,no,excelent,no
443,53,,yes,excelent,yes
786,44,,yes,fair,yes
988,61,,yes,fair,yes


In [9]:
incomeMean = dataSetRaw["income"].mean()
incomeMean

9.922535211267606

In code above I count mean values while there is some error values in the data set which can make a slight difference in result. In part 2 I will have mean value after I manage error values (negative incomes).

In [10]:
dataSetFillMean = dataSetRaw
dataSetFillMean.iloc[nullIndices, 1] = incomeMean
dataSetFillMean.iloc[nullIndices]

Unnamed: 0,age,income,student,cradit_rating,buy_computer
10,33,9.922535,yes,fair,yes
108,50,9.922535,no,fair,yes
226,33,9.922535,no,excelent,no
443,53,9.922535,yes,excelent,yes
786,44,9.922535,yes,fair,yes
988,61,9.922535,yes,fair,yes


# 2- finding values with error in income column

In [11]:
# using this query on dataframe will help to find odds.
dataSetRaw[dataSetRaw['income'] <= 0]

Unnamed: 0,age,income,student,cradit_rating,buy_computer
20,40,-13.0,no,excelent,no
820,43,-15.0,no,fair,no


<i>You can replace all values or selected values in a column of pandas DataFrame based on condition by using <code>DataFrame.loc[]</code>, <code>np.where()</code> and <code>DataFrame.mask()</code> methods.<l> [sparksby{examples}](https://sparkbyexamples.com/pandas/pandas-replace-values-based-on-condition/#:~:text=You%20can%20replace%20values%20of,the%20values%20of%20pandas%20DataFrame.)


In [12]:
# to find incomes mean value of income column while error values are not been counted
incomeMean = dataSetRaw[dataSetRaw['income'] >= 0].mean(axis=0, numeric_only=True)[1]

In [13]:
dataSet = dataSetRaw
dataSet.iloc[nullIndices, 1] = incomeMean
dataSet.loc[dataSet['income'] <= 0, 'income'] = incomeMean

In [14]:
dataSet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   age            1000 non-null   int64  
 1   income         1000 non-null   float64
 2   student        1000 non-null   object 
 3   cradit_rating  1000 non-null   object 
 4   buy_computer   1000 non-null   object 
dtypes: float64(1), int64(1), object(3)
memory usage: 39.2+ KB


<b>answer:</b>
as we had two errors in incomes, I replace them by means of all valid incomes. In cells above, you can see values before and after I applied to find and correct errors.

# 3. finding similarity between customers

<b>answer:</b> <br> <h3>Dissimilarity for attributes of mixed types</h3>
Han, Data Mining, Concepts and Techniques, Fourth Edition <br><br>
$$
 d_{(i, j)} = \frac{\sum\limits_{f=1}^{p}{\delta}_{ij}^{(f)}d_{ij}^{(f)}}{\sum\limits_{f=1}^{p}{\delta}_{ij}^{(f)}}\, 
$$
where the indicator ${\delta}_{ij}^{(f)} = 0$ if either (1) $x_{if}$ or $x_{jf}$ is missing (i.e., there is no measurement of attribute
$f$ for object $i$ or object $j$ ), or (2) $x_{if} = x_{jf} = 0$ and attribute $f$ is asymmetric binary; otherwise, ${\delta}_{ij}^{(f)} = 1$

As far as we have age (nymberic type), income (numeric type), student (asymmetric binary), credit_rate (ordinal), and buy_computer (asymmetric binary) attributes we have to use (1 - dissimilarity) to get similarity between to customers

# 4. similarity function implementation

First lets have a copy of the preprocessed data set (nulls and errors removed). Then I will use my own normalization functions on this copy to make the ready for numeric dissimilarity function and at last for the mixed types dissimilarity functions. 

In [15]:
dfCopy = dataSet
dfCopy.head(5)

Unnamed: 0,age,income,student,cradit_rating,buy_computer
0,69,7.0,yes,excelent,yes
1,65,11.0,no,excelent,no
2,41,5.0,no,excelent,yes
3,65,5.0,yes,excelent,yes
4,23,14.0,no,fair,yes


non-numeric columns have been derived from domains below. these domains will help in mapping these values to a proper one.

In [16]:
# in this state, I find the domain of each columns
studentAtr = dfCopy['student'].unique()
cradit_ratingAtr = dfCopy['cradit_rating'].unique()
buy_computerAtr = dfCopy['buy_computer'].unique()
studentAtr, cradit_ratingAtr, buy_computerAtr

(array(['yes', 'no'], dtype=object),
 array(['excelent', 'fair'], dtype=object),
 array(['yes', 'no'], dtype=object))

using a dictionary to map 

In [17]:
# in order to run Dissimilarity or similarity functions I have to normalize the values\
# as suitable inputs to the algorithms - by using map method and creating dictionaries of reletive values.
student = {'yes': True, 'no': False}
credit_rate = {'excelent': True, 'fair': False}
buy_computer = {'yes': True, 'no': False}
dfCopy['student'] = dfCopy['student'].map(student)
dfCopy['cradit_rating'] = dfCopy['cradit_rating'].map(credit_rate)
dfCopy['buy_computer'] = dfCopy['buy_computer'].map(buy_computer)
dfCopy.head(5)

Unnamed: 0,age,income,student,cradit_rating,buy_computer
0,69,7.0,True,True,True
1,65,11.0,False,True,False
2,41,5.0,False,True,True
3,65,5.0,True,True,True
4,23,14.0,False,False,True


In [18]:
def zScore(x=0, mean=0, std=1):
    return (x - mean) / std

def minMax(x=0, minVal=0, maxVal=1):
    return (x - minVal) / (maxVal - minVal)

# def createMatrix(size=(2, 2)):
#     from numpy import zeros
#     return zeros(shape=size, dtype=int)

By using functions above, numeric types (columns age and income) will be updated

In [19]:
dataSetMean = dfCopy.mean(numeric_only=True)
dataSetStd = dfCopy.std(numeric_only=True)
minVal = dfCopy.min(numeric_only=True)
maxVal = dfCopy.max(numeric_only=True)

incomeMean = dataSetMean[1]
incomeStd = dataSetStd[1]

ageMean = dataSetMean[0]
ageStd = dataSetStd[0]

income = []
age = []

for index, item in enumerate(dfCopy['income']):
    income.append(zScore(x=item, mean=incomeMean, std= incomeStd))
    
for index, item in enumerate(dfCopy['age']):
    age.append(zScore(x=item, mean=ageMean, std= ageStd))
dfCopy['income'] = income
dfCopy['age'] = age
dfCopy.head(5)

Unnamed: 0,age,income,student,cradit_rating,buy_computer
0,1.637402,-0.942644,True,True,True
1,1.358553,0.326584,False,True,False
2,-0.314542,-1.577259,False,True,True
3,1.358553,-1.577259,True,True,True
4,-1.569363,1.278505,False,False,True


in ordet to measure similarity between 2 customer, involving buy_computer column to the formula is wrong. because obviously our goal is to predict whether a new customer buy a computer or not. a simple way is to seprate X, and y at first. but What I did is just to leave last column. 

In [20]:
X = dfCopy.iloc[:,:-1]
y = dfCopy.iloc[:,-1]

functions below will be used in order to calculate dissimilarity in asymmetric binary types and numeric types.
as the question 4 specified the input parameters and the output parameter, there are 6 input parameters in <code> binaryDissimilarity(student1=False, student2=False, credit1=False, credit2=False):</code><br>
In <code>numericDissimilarity(i=[], j=[], euclideanDistance=True):</code> function to the contrary, the inputs are a list of attributes.<br>
<code>mixedTypesDissimilarity(i=[], j=[])</code> function takes copied dataframe's 2 records as inputs and give a dissimilarity figure as an output. As we know (1 - dissimilarity) is equal to similarity.

In [21]:
# def binaryDissimilarity(student1=False, student2=False, credit1=False, 
#                         credit2=False, buy1=False, buy2=False):
#     from numpy import array
#     dataMatrix = array(([student1, credit1, buy1],
#                         [student2, credit2, buy2]))
def binaryDissimilarity(student1=False, student2=False, credit1=False, 
                        credit2=False):
    from numpy import array
    dataMatrix = array(([student1, credit1],
                        [student2, credit2]))
    m_01 = 0
    m_10 = 0
    m_11 = 0
    M_00 = 0
    for i in range(dataMatrix.shape[1]):
        if dataMatrix[:,i][0] == True and dataMatrix[:,i][1] == True:
            m_11 += 1
        elif dataMatrix[:,i][0] == False and dataMatrix[:,i][1] == True:
            m_01 += 1
        elif dataMatrix[:,i][0] == True and dataMatrix[:,i][1] == False:
            m_10 += 1
        elif dataMatrix[:,i][0] == False and dataMatrix[:,i][1] == False:
            M_00 += 1
    return (m_01 + m_10) / (m_01 + m_10 + m_11)

def numericDissimilarity(i=[], j=[], euclideanDistance=True):
    if len(i) == 0 and len(j) == 0:
        return
    if euclideanDistance:
        from math import sqrt
        from numpy import array
        dataMatrix = array((i, j))
        result = 0
        for i in range(dataMatrix.shape[1]):
            result += (dataMatrix[:,i][0] - dataMatrix[:,i][1]) ** 2
        return sqrt(result)

def mixedTypesDissimilarity(i=[], j=[]):  
    if len(i) == 0 and len(j) == 0:
        return
    binary = []
    numberic = []    
    
    for col in range(len(i)):
        if type(i[col]) == bool and type(j[col]) == bool:
            binary.append((i[col], j[col]))
        else:
            numberic.append((i[col], j[col]))
            
    resultSum = binaryDissimilarity(student1=binary[0][0], student2=binary[0][1],
                       credit1=binary[1][0], credit2=binary[1][1])
    from numpy import column_stack
    i = column_stack(numberic)[0]
    j = column_stack(numberic)[1]
    resultSum += numericDissimilarity(i=i, j=j)
    return resultSum / (len(binary) + len(numberic))

def similarity(dissimilarity):
    return 1 - dissimilarity

In [22]:
i = [1.637402, -0.942644, True, True]
j = [1.358553, 0.326584, False, True]
d = mixedTypesDissimilarity(i=i, j =j)

In [23]:
similarity(d)

0.5501253625641692

# 5. buying computer prediction (k=1)

Function below gets dataframe X and a new record to predict if a new customer will buy a coumputer or not

In [24]:
def mostSimilarty(df, new):
    mostSimilar = [0, 0]
    # mostSimilar[0][0]
    for index, row in enumerate(df.values):
        d = mixedTypesDissimilarity(i=new, j=row)
        if mostSimilar[1] < similarity(dissimilarity=d):
            mostSimilar[1] = similarity(dissimilarity=d)
            mostSimilar[0] = index
    return mostSimilar
new = [1.9, -0.3490, False, True]
mostSimilar = mostSimilarty(df=dfCopy, new=new)
mostSimilar

[192, 0.9507028855479938]

In [25]:
def getRecord(mostSimilar):
    record = dfCopy.iloc[mostSimilar[0]]
    return record
record = getRecord(mostSimilar=mostSimilar)
record

age              1.707115
income           -0.30803
student             False
cradit_rating        True
buy_computer        False
Name: 192, dtype: object

In [26]:
def isBuy(record):
    return record[4]
isBuy(record)

False

# 6 section 5 (k=3)

In [27]:
def kMostSimilar(df, new, k=3):
    kMost = []
    copy = df.copy()
    for i in range(k):
        kmax = mostSimilarty(copy, new)
        kMost.append(kmax)
        copy.drop(kmax[0], inplace=True)
    return kMost
kMost = kMostSimilar(df=dfCopy, new=new, k=3)
kMost

[[192, 0.9507028855479938],
 [222, 0.9507028855479938],
 [221, 0.9507028855479938]]

In [28]:
def majorityVote(kMost, p=2):
    counter = 0
    size = len(kMost)
    for item in kMost:
        record = getRecord(item)
        print(item)
        print(record)
        if isBuy(record):
            counter += 1
    if ((counter / size) >= ((size // p) + 1) / size):
        return True
    return False
    
majorityVote(kMost=kMost) 

[192, 0.9507028855479938]
age              1.707115
income           -0.30803
student             False
cradit_rating        True
buy_computer        False
Name: 192, dtype: object
[222, 0.9507028855479938]
age              0.243156
income           0.643891
student             False
cradit_rating        True
buy_computer         True
Name: 222, dtype: object
[221, 0.9507028855479938]
age             -1.151089
income           1.595812
student             False
cradit_rating        True
buy_computer         True
Name: 221, dtype: object


True