## In-Class 38-1 - Data Preprocessing In C++

## Create folder

<b>To review:</b>

<b>1.</b> Open Docker Desktop. Go to the cse20133-user Container you created. Press the Blue Triangle to start the Container.

<b>2.</b> Open VSCode, and on the left, select the blue button in the bottom left of VSCode. At the top, select the pull-down choice “Attach to running container” and select your CSE 20133 course container.

<b>3.</b> Go into your Git Folder:

> Recall that @USERNAME is the unique username you created when you created your GitHub account. You will see your user name in the VS Code Docker

    cd cse20133-user/cse20133-@USERNAME

Create the folder:

    mkdir lec38
    cd lec38
    
    
### Obtaining the class files

Perform the following command:

    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/inclass/lec38/setup.sh
    chmod a+rx setup.sh
    ./setup.sh
    
Once you perform these commands, you will get the following:

    Makefile electric.cpp electric_sol.cpp sin_cos.cpp sin_cos_sol.cpp

### What is Data Preprocessing?

<b>Data preprocessing</b> is the process of <b>converting raw data</b> into <b>computer understandable formats</b>, it’s the first step in any machine learning operation.

Data preprocessing includes:
<ul>
    <li>Reading Data from files.</li>
    <li>Data cleaning.</li>
    <li>Instance selection.</li>
    <li>Data standardization.</li>
    <li>Data transformation.</li>
    <li>Feature extraction and selection.</li>
</ul>

The product of data preprocessing is the final <b>training set</b>. A training set is the a set of examples used to fit the parameters of the ML model.

### Initial Example: The Iris Data set

In the example, we will perform Data Preprocessing on the Iris data set. The <a href = "https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris data set</a> is one of the best‐known and most widely used data sets in statistics and data science used for evaluating classification methods.

The data set consists of 50 samples from each of three species of Iris:
<ol>
    <li>Iris setosa</li>
    <li>Iris virginica</li>
    <li>Iris versicolor</li>
</ol>

Four features were measured from each sample in centimeters:
<ol>
    <li>the length of the sepals</li>
    <li>the width of the sepals</li>
    <li>the length of the petals</li>
    <li>the width of the petals</li>
</ol>

Based on the combination of these four features, <a href = "https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald Fisher</a> developed a linear discriminant model to distinguish the species from each other. 

### Reading the Iris Dataset from Files

You can download the iris dataset into your container using the following commands in VS Code:

    mkdir reading38
    cd reading38
    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/readings/lec38/iris.data
    
Looking at the first few lines, we can see how the data corresponds to our idea of how the dataset is organized:

    5.1 3.5 1.4 0.2 Iris-setosa
    4.9 3.0 1.4 0.2 Iris-setosa
    4.7 3.2 1.3 0.2 Iris-setosa
    4.6 3.1 1.5 0.2 Iris-setosa

### Iris Class

In the <a href = "https://github.com/mmorri22/cse20133/blob/main/readings/lec38/preprocessing.h">preprocessing.h</a> file, we have an Iris class that contains the values from the dataset, a constructor and a destructor, and a print function:

    class Iris {

        private: 
            float sepal_length; 
            float sepal_width;
            float petal_length;
            float petal_width;
            std::string ir_class;

        public:

            // Constructor
            Iris( const float& sepal_length, const float& sepal_width, 
                const float& petal_length, const float& petal_width, const std::string& ir_class ) :
                sepal_length(sepal_length), sepal_width(sepal_width), petal_length(petal_length),
                petal_width(petal_width), ir_class(ir_class) {}

            // Destructor
            ~Iris(){}
            
            // Accessors
            float get_sepal_length() const{
                return this->sepal_length;
            }

            float get_sepal_width() const{
                return this->sepal_width;
            }

            float get_petal_length() const{
                return this->petal_length;
            }

            float get_petal_width() const{
                return this->petal_width;
            }

            // Print outcome
            void print_iris_data() const{

                std::cout << "------------------------------------------------------" << std::endl;
                std::cout << "Type: " << this->ir_class << std::endl;
                std::cout << "Sepal Length and Width: " <<  this->sepal_length << ", " << sepal_width << std::endl;
                std::cout << "Petal Length and Width: " <<  this->petal_length << ", " << petal_width << std::endl;

            }
    }; 

### Reading in from the File

In this <a href = "https://github.com/mmorri22/cse20133/blob/main/readings/lec38/preprocessing.cpp">processing.cpp file</a>, we read the data from a file with simple read file instructions and parse each type of data in a separate vector.

    std::vector< Iris > read_iris_dataset( const std::string& file_name ){

        /* Create the input stream from iris.data, which is stored in file_name */
        std::ifstream myfile( file_name );

        /* We will eventuall store all these a vector of Iris's called iris_dataset */
        std::vector< Iris > iris_dataset;

        /* Now we will create intermediate values to read in from the file */
        float sepal_len, sepal_wid, petal_len, petal_wid;
        std::string type_string;

        /* Check that the file exists! */
        if ( myfile.is_open() ){

            std::cout<< "Iris Dataset opened successfully" <<std::endl;

            /* Get the next line from the file */
            while (myfile >> sepal_len >> sepal_wid >> petal_len >> petal_wid >> type_string) {

                /* Construct an Iris using the constructor */
                Iris temp_iris( sepal_len, sepal_wid, petal_len, petal_wid, type_string );

                /* Push it onto the back of the vector */
                iris_dataset.push_back( temp_iris );

            }

        }

        else{
            /* Print to the user that the file was not opened */
            std::cout << "Unable to open file" << std::endl;
        }

        /* Return the std::vector representing the data set */
        return iris_dataset;
    }


### Printing the Dataset

In this function, we iterate through the dataset and print the results.

    void print_iris_dataset(const std::vector< Iris >& iris_dataset){

        // Iterate through the entire data set
        for( long unsigned int iter = 0; iter < iris_dataset.size(); ++iter ){

            // Print that Iris
            iris_dataset.at(iter).print_iris_data();
        }

    }

### Code Set

The code set may be found at the following files:

    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/readings/lec38/iris.data
    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/readings/lec38/iris_prepr.cpp
    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/readings/lec38/preprocessing.cpp
    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/readings/lec38/preprocessing.h
    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/readings/lec38/Makefile
    
And you can compile and run using the command:

    make iris_prepr

### <font color = "red">Class Introduction Question #4 - What is a Data Preprocessing and how do we use data preprocessing in machine learning?</a>

### The next reading for this lecture is <a href = "https://github.com/mmorri22/cse20133/blob/main/readings/lec38/Reading%2038-2.ipynb">Reading 38-2 - Data Visualization In C++.</a>