# Data Features 

Given a dataset of  𝑀  variables and  𝑁  dataset points, a feature is one of the **independent** variables in a dataset and also called as a **predictor**. Generally the input to a machine learning program is a column of a tabular dataset where each row (of  𝑁  rows) is a dataset point in  𝑀  dimensional space.

In our Jupyter notebooks we will use the matrix  𝑋𝑁×𝑀  as the dataset symbol without the dependent variable (also called the label, category, class, predicted variable) and we will store the dependent variable in the vector  𝑦𝑁×1 .

𝑋  is a matrix. It's rows are data points and it's columns are the features. All classifiers in scikit-learn can understand this data format which is based on numpy 2-dimensional arrays. Thus for each input data point we have  𝑥∈ℝ𝑀 , and we have  𝑁  data points to be used in our ML pipelines. We also have  𝑦∈ℤ  in general as the category. As an example, if we have  𝐾  categories, then  𝑦∈{𝑘:0≤𝑘≤𝐾,𝑘∈ℤ}

**Example:** In the following  𝑋  matrix we have 3 features and 5 data points, i.e.  𝑀=3  and  𝑁=5 . We also have  5  values for dependent variable  𝑦 .

© Guven
Data Features 
Introduction
Given a dataset of  𝑀  variables and  𝑁  dataset points, a feature is one of the independent variables in a dataset and also called as a predictor. Generally the input to a machine learning program is a column of a tabular dataset where each row (of  𝑁  rows) is a dataset point in  𝑀  dimensional space.

In our Jupyter notebooks we will use the matrix  𝑋𝑁×𝑀  as the dataset symbol without the dependent variable (also called the label, category, class, predicted variable) and we will store the dependent variable in the vector  𝑦𝑁×1 .

𝑋  is a matrix. It's rows are data points and it's columns are the features. All classifiers in scikit-learn can understand this data format which is based on numpy 2-dimensional arrays. Thus for each input data point we have  𝑥∈ℝ𝑀 , and we have  𝑁  data points to be used in our ML pipelines. We also have  𝑦∈ℤ  in general as the category. As an example, if we have  𝐾  categories, then  𝑦∈{𝑘:0≤𝑘≤𝐾,𝑘∈ℤ} 
Example: In the following  𝑋  matrix we have 3 features and 5 data points, i.e.  𝑀=3  and  𝑁=5 . We also have  5  values for dependent variable  𝑦 .

𝑋=⎡⎣⎢⎢⎢⎢⎢𝑥11𝑥21𝑥31𝑥41𝑥51𝑥12𝑥22𝑥32𝑥42𝑥52𝑥13𝑥23𝑥33𝑥43𝑥53⎤⎦⎥⎥⎥⎥⎥ ,  𝑦=⎡⎣⎢⎢⎢⎢⎢𝑦1𝑦2𝑦3𝑦4𝑦5⎤⎦⎥⎥⎥⎥⎥        Such that in practice,  𝑋=⎡⎣⎢⎢⎢⎢⎢4.94.74.65.5.43.3.23.13.63.91.41.31.51.41.7⎤⎦⎥⎥⎥⎥⎥ ,  𝑦=⎡⎣⎢⎢⎢⎢⎢10012⎤⎦⎥⎥⎥⎥⎥

where the dependent variable  𝑦  has levels from the alphabet  Σ={0,1,2} , i.e. there are 3 categories in the given dataset.

The predicted variable or label is the **dependent** variable and it is dependent on the **independent** variables or features. This amount of dependence can be sometimes high and sometimes very low depending on the dataset or the nature of the problem (again, dataset expresses this). If there is no correlation (or fully independent), then the dataset may not be suitable for the problem at hand and/or we may have to remove that feature from the dataset.

As an example, for numerical variables, the Pearson correlation coefficient of two variables  𝑥  (lower case x, a feature) and  𝑦  is defined as:

𝑟𝑥𝑦=∑𝑁𝑖=1(𝑥𝑖−𝑥¯)(𝑦𝑖−𝑦¯)∑𝑁𝑖=1(𝑥𝑖−𝑥¯)2‾‾‾‾‾‾‾‾‾‾‾‾‾√∑𝑁𝑖=1(𝑦𝑖−𝑦¯)2‾‾‾‾‾‾‾‾‾‾‾‾‾√ , where  𝑥¯  and  𝑦¯  are sample means.

Ideally, we need a good correlation (close to 1 or -1) between the independent and dependent (predicted) variables so our ML model would actually work.

**Question:** Can correlation coefficient be used for determining important features for the machine learning model?

## Data types

* Numerical - Can be integer  ℤ  or floating point  ℝ . Generally it is safe to convert all numerical variables to floating point variables.
* Nominal - The variable values are drawn from a finite set of levels or from an alphabet  Σ .
* Binary - The variable values can be either 0 or 1 (or, False or True). Some algorithms work fast on this kind of values, especially constrained optimization related methods.
* String - May not be used directly unless the ML program (preprocessing) knows how to deal with it
* Date - May not be used directly unless the ML program (preprocessing) knows how to deal with it. It might be a good idea to convert (or map) dates to some integer numbers - for example, Excel handles dates in this manner.
* More complicated features, e.g. a DNA sequence (sequence of {A,C,G,T} letters) - Other, simpler, features need to be extracted from the input sequence so that this higher-level feature can be used in an ML program.

## Nominal to Numerical Conversion

One possible way of converting nominal variables to numerical is one-hot encoding:

1. During preprocessing count the number of levels in the set of possible levels a nominal variable  𝑣𝑛𝑜𝑚  takes. Such as,  𝐿  different levels,  𝑘=1,...,𝐿 .
2. Create  𝐿  binary variables for that nominal variable  𝑣𝑛𝑜𝑚  where each row will have a binary zero for  𝐿−1  binary variables except for the jth level which corresponds to the level-j when  𝑣𝑛𝑜𝑚  takes a value of level-j.

Following above procedure, a nominal variable with a cardinality of  𝐿  results in  𝐿  many binary variable creation (and dropping the original nominal variable itself). In other words, each unique level of that nominal variable is mapped to a binary variable. Note that, for the sake of this representation, storage space is wasted.

Also observe that the one-hot encoded variables are like unit vectors of linear algebra.

Conversion of nominal variables to numerical is an important step for many numerical-only classifiers, such as neural networks, support vector machines, and linear regression.

### Example One-hot Encoding

The nominal variable  𝑇  take levels from the  Σ={low,medium,high} . Numerical conversion involves each unique level being mapped to one of the  𝑇𝑖  binary vectors.

Nominal Variable  𝑇 	 𝑇0 	 𝑇1 	 𝑇2 
low	1	0	0
medium	0	1	0
low	1	0	0
high	0	0	1
high	0	0	1
low	1	0	0

## Numerical to Nominal Conversion
Generally, histograms, binning and bin boundaries are used to group numerical values into levels or one-hot encoded variables.

## Online Dataset Sources

The UCI KDD online repository has various datasets which can be used for analysis, machine learning and several application fields, such as GIS, cybersecurity, NLP, etc. The origin of some datasets go back to more than 20 years sourced from competitions, challenges, grants, etc. Researchers and students use these datasets and share their experiences using a common platform.

Source: UCI Knowledge Discovery in Databases Archive http://kdd.ics.uci.edu/

**Kaggle** data repository has various datasets which are used for Kaggle competitions. The web site also has tools to examine the features on-site. This source is one of the largest.
Go to the Kaggle dataset source: https://www.kaggle.com/datasets

**KDnuggets** is another web page which encompasses almost everything (posts, news, datasets, tutorials, forums, webinars, software, etc.) that is relevant to machine learning and data analysis.
Source: KDnuggets Datasets for Data Mining and https://www.kdnuggets.com/datasets/index.html

The rest of the notebook will demonstrate three different datasets from these repositories.

* UCI KDD archive  →  1990 US Census data
* Kaggle  →  Graduate Admissions data
* Kaggle  →  The Human Freedom Index data

Download the data files from UCI KDD web site and Kaggle web site (by registering to Kaggle -using a disposable email address- if necessary).

**Important Note:** About physical dataset file shared among teams. Comparing machine learning models, and measuring performances for model selection is heavily dependent on the input dataset. Thus, if a comparison between models and a comparison among different experiments or teams results are at hand, then the dataset shared among teams or different set of experiments must be exactly the same dataset. Moreover, to ensure the validity, the exact same file should be shared among multiple teams or between different models pipelines.

In the following cells we use the downloaded and previously cleaned data files:

* USCensus1990.data.csv
* Admission_Predict.csv
* hfi_cc_2018_cleaned.csv

Note that there are two dataset cleaning tasks before a machine learning model development can begin:

* Cleaning the data so the framework understands the data right, i.e. formatting, removing confusing symbols, quotes, etc.
* Cleaning (preprocessing) the data to improve the ML task, i.e. imputing values, removing outliers, removing incorrect dataset values, deriving variables, selecting variables, etc.

Both cleaning steps are crucial in preparing the dataset for model development.

**Quote:** "As data scientists, our job is to extract signal from noise." (ref: KDnuggets)

Let's see what our data files contain:

In [None]:
import pandas as pd

# Locate and load the data file
df = pd.read_csv('../datasets/USCensus1990.data.csv')

# Sanity check
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

In [None]:
# Locate and load the data file
df = pd.read_csv('../datasets/Admission_Predict_Ver1.1.csv')

# Sanity check
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

In [None]:
# Locate and load the data file
df = pd.read_csv('../datasets/hfi_cc_2018.csv')

# Sanity check
print(f'N rows={len(df)}, M columns={len(df.columns)}')
df.head()

## The Human Freedom Index Dataset¶
Opening the hfi_cc_2018 CSV data file in Weka is tricky as it needs two modifications as in below:

* Using a text editor, change the value "d'Ivoire" to "dIvoire" by removing the single quote. The single quote is used by Weka to mark nominal variables.
* Using a text editor add single quotes to the feature name region to mark it as nominal. Weka wants to see single quotes in the variable name (in the header) to be able to load the type of the variables correctly.


This particular example shows that data mining, machine learning frameworks such as Weka have their own standards that the model developer has to pay attention.

## Weka Framework

Weka is a data analytics framework (open-source, Java based) with very strong ML and data mining abilities. To install:

1. Download and install 64-bit Java JRE https://www.java.com/en/download/
2. Download Weka Linux distro zip file from https://www.cs.waikato.ac.nz/ml/weka/downloading.html

and extract the zip to C:\weka on your computer's local disk. Use Windows command prompt:

1. Check the version of Java: java -version so that make sure the java on the path is reflected to the one downloaded.
2. On a command prompt, run java -Xmx8g -jar c:\weka\weka.jar (8GB heap space to be used for big data files - make sure your computer supports, or adjust this value)

WATCH THE RELATED LECTURE for opening, using Weka, preprocessing and running Random Forest classifier.

## Graduate Admissions Dataset

Let's open the data file Admission_Predict.csv in Weka. Click on Explorer and open the CSV file with Open File button. We need a dependent variable to predict or do something with it. Let's pick A9 - Chance of Admit (A_XX is the attribute which starts from index 1). We need to convert this variable to a categorical variable.

1. In Filter, AddExpression -E "ifelse (A9 > 0.9, 1, 0)" -N Admit then press Apply
2. In Filter, NumericToNominal -R last
3. In Filter, RenameNominalValues -R last -N "0:No, 1:Yes"

After preprocessing pick the RandomForest (RF) classifier from Classifier-Trees-RandomForest. Run it with 10-fold cross validation, with Start button. Observe the outcome

**Question:** Why do you think RF model performance results 100%?

Now remove the variable "Serial No." (Why useless?) and remove "Chance of Admit" variable (Remove button down below). Remember we categorized it to the variable named "Admit".

**Question:** Why do you think RF model performance is less than 100% now?

---

Student: Below cell can be safely ignored. This is the code to display markdown tables left oriented in this notebook.

In [None]:
%%html
<style>
    table {margin-left: 0 !important;}
</style>