<a href="https://colab.research.google.com/github/michaelchapa/dataMining_data_preprocessing/blob/master/dataMining_preProcessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Pre-Processing Data</h1>
<p>Read data from the provided CSV file into a DataFrame. The file has 6 columns: <code>[A, B, C, D, E, F].</code></p>
<p><code>A, B</code> are <b>categorical</b>.</p>
<p><code>C, D, E, F</code> are <b>numerical</b>.</p>


In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from scipy import stats

data = pd.read_csv('https://raw.githubusercontent.com/michaelchapa' \
                   '/dataMining_data_preprocessing/master/hwk01.csv')

numericalFeatures = data[data.columns[3:]] # remove redundant index column
nominalFeatures = data[data.columns[1:3]]

<h2>Mean</h2>

In [None]:
def bin_Means(data, depth):
    data = data['F'] # Creates Series
    data = data.sort_values()
    
    binValues, binEdges = pd.cut(data.array, bins = depth, \
                                 labels = range(1, depth + 1), retbins = True)
        
    print("The respective bin for each value of attribute: \n", set(binValues), "\n")
    print("Computed bins: \n", binEdges, "\n")
    
    binnedValues = pd.DataFrame( \
                   list(zip(data, binValues)), columns = ['value', 'bin'])
    binnedValuesMean = binnedValues.groupby(['bin']).mean()
    print("Mean value for each value in bin: \n", binnedValuesMean, "\n\n")

<h4>Quartile example:</h4>

In [None]:
bin_Means(numericalFeatures, 4) # k = 4, 10, 50

<h4>K = 100 example:</h4>

In [None]:
bin_Means(numericalFeatures, 100)

<h2>Boundaries</h2>

In [4]:
def bin_Boundaries(data, depth):
    data = data['E']
    data = data.sort_values()
    
    binValues, binEdges = pd.cut(data.array, bins = depth, \
                                 labels = range(1, depth + 1), retbins = True)
    
    print("The respective bin for each value of attribute: \n", set(binValues), "\n")
    print("The computed specified bins: \n", binEdges, "\n")
    
    binnedValues = pd.DataFrame( \
                   list(zip(data, binValues)), columns = ['value', 'bin'])
    
    for index, observation in binnedValues.iterrows():
        value = observation[0].tolist()
        minDistance = 999999999
        leastDistant = 0
        
        for edge in binEdges:
            edge = edge.tolist()
            distance = abs(edge - value)
            if distance < minDistance:
                leastDistant = edge
                minDistance = distance
                
        # set value at dataframe
        binnedValues.at[index, 'value'] = leastDistant
        
    print(binnedValues)

<h4>Quartile Example</h4>

In [5]:
bin_Boundaries(numericalFeatures, 4)

The respective bin for each value of attribute: 
 {1, 2, 3, 4} 

The computed specified bins: 
 [-16.5662972    1.30530452  19.10570464  36.90610475  54.70650487] 

         value  bin
0   -16.566297    1
1   -16.566297    1
2   -16.566297    1
3   -16.566297    1
4   -16.566297    1
..         ...  ...
995  36.906105    4
996  36.906105    4
997  36.906105    4
998  36.906105    4
999  54.706505    4

[1000 rows x 2 columns]


<h4>K = 100</h4>

In [6]:
bin_Boundaries(numericalFeatures, 100)

The respective bin for each value of attribute: 
 {1, 6, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 77, 78, 79, 80, 81, 84, 85, 86, 100} 

The computed specified bins: 
 [-16.5662972  -15.78307959 -15.07106359 -14.35904758 -13.64703158
 -12.93501558 -12.22299957 -11.51098357 -10.79896756 -10.08695156
  -9.37493555  -8.66291955  -7.95090354  -7.23888754  -6.52687153
  -5.81485553  -5.10283952  -4.39082352  -3.67880751  -2.96679151
  -2.25477551  -1.5427595   -0.8307435   -0.11872749   0.59328851
   1.30530452   2.01732052   2.72933653   3.44135253   4.15336854
   4.86538454   5.57740055   6.28941655   7.00143256   7.71344856
   8.42546457   9.13748057   9.84949657  10.56151258  11.27352858
  11.98554459  12.69756059  13.4095766   14.1215926   14.83360861
  15.5456

<h2>Median</h2>

In [7]:
def bin_Medians(data, depth):
    data = data['F'] # Creates Series
    data = data.sort_values()
    
    binValues, binEdges = pd.cut(data.array, bins = depth, \
                labels = range(1, depth + 1), retbins = True)
        
    print("The respective bin for each value of attribute: \n", set(binValues), "\n")
    print("The computed specified bins: \n", binEdges, "\n")
    
    binnedValues = pd.DataFrame( \
                   list(zip(data, binValues)), columns = ['value', 'bin'])
    binnedValuesMedian = binnedValues.groupby(['bin']).median()
    print("Median value for each value in bin: \n", binnedValuesMedian, "\n\n")

<h4>Quartile Example</h4>

In [8]:
bin_Medians(numericalFeatures, 4)

The respective bin for each value of attribute: 
 {1, 2, 3, 4} 

The computed specified bins: 
 [ 0.99  3.5   6.    8.5  11.  ] 

Median value for each value in bin: 
      value
bin       
1        2
2        5
3        7
4       10 




<h4>K = 100</h4>

In [9]:
bin_Medians(numericalFeatures, 100)

The respective bin for each value of attribute: 
 {1, 100, 70, 40, 10, 80, 50, 20, 90, 60, 30} 

The computed specified bins: 
 [ 0.99  1.1   1.2   1.3   1.4   1.5   1.6   1.7   1.8   1.9   2.    2.1
  2.2   2.3   2.4   2.5   2.6   2.7   2.8   2.9   3.    3.1   3.2   3.3
  3.4   3.5   3.6   3.7   3.8   3.9   4.    4.1   4.2   4.3   4.4   4.5
  4.6   4.7   4.8   4.9   5.    5.1   5.2   5.3   5.4   5.5   5.6   5.7
  5.8   5.9   6.    6.1   6.2   6.3   6.4   6.5   6.6   6.7   6.8   6.9
  7.    7.1   7.2   7.3   7.4   7.5   7.6   7.7   7.8   7.9   8.    8.1
  8.2   8.3   8.4   8.5   8.6   8.7   8.8   8.9   9.    9.1   9.2   9.3
  9.4   9.5   9.6   9.7   9.8   9.9  10.   10.1  10.2  10.3  10.4  10.5
 10.6  10.7  10.8  10.9  11.  ] 

Median value for each value in bin: 
      value
bin       
1        1
10       2
20       3
30       4
40       5
50       6
60       7
70       8
80       9
90      10
100     11 




In [None]:
pcaAnalysis(numericalFeatures, ['C', 'D', 'E', 'F'], 2)

In [None]:
calculate_correlation(numericalFeatures)

In [None]:
construct_contingency_table(nominalFeatures)