## Reading 32-3 - Data Visualization In C++

In this reading, we will observe how to install <a href = "https://github.com/Cryoris/matplotlib-cpp">Matplotlib-CPP</a> in your Ubuntu environment in order to better visualize data.

### What is Data Preprocessing?

<b>Data preprocessing</b> is the process of <b>converting raw data</b> into <b>computer understandable formats</b>, it’s the first step in any machine learning operation.

Data preprocessing includes:
<ul>
    <li>Reading Data from files.</li>
    <li>Data cleaning.</li>
    <li>Instance selection.</li>
    <li>Data standardization.</li>
    <li>Data transformation.</li>
    <li>Feature extraction and selection.</li>
</ul>

The product of data preprocessing is the final <b>training set</b>. A training set is the a set of examples used to fit the parameters of the ML model.

### Initial Example: The Iris Data set

In the example, we will perform Data Preprocessing on the Iris data set. The <a href = "https://en.wikipedia.org/wiki/Iris_flower_data_set">Iris data set</a> is one of the best‐known and most widely used data sets in statistics and data science used for evaluating classification methods.

The data set consists of 50 samples from each of three species of Iris:
<ol>
    <li>Iris setosa</li>
    <li>Iris virginica</li>
    <li>Iris versicolor</li>
</ol>

Four features were measured from each sample in centimeters:
<ol>
    <li>the length of the sepals</li>
    <li>the width of the sepals</li>
    <li>the length of the petals</li>
    <li>the width of the petals</li>
</ol>

Based on the combination of these four features, <a href = "https://en.wikipedia.org/wiki/Ronald_Fisher">Ronald Fisher</a> developed a linear discriminant model to distinguish the species from each other. 

### Reading the Iris Dataset from Files

You can download the iris dataset into your container using the following commands in VS Code:

    mkdir reading32
    cd reading32
    wget https://raw.githubusercontent.com/mmorri22/cse20133/main/readings/lec32/iris.data
    
Looking at the first few lines, we can see how the data corresponds to our idea of how the dataset is organized:

    5.1,3.5,1.4,0.2,Iris-setosa
    4.9,3.0,1.4,0.2,Iris-setosa
    4.7,3.2,1.3,0.2,Iris-setosa
    4.6,3.1,1.5,0.2,Iris-setosa



    std::vector<std::vector<float>> Read_Iris_Dataset(void)
    {
        std::ifstream myfile("iris.data");
        std::string line;
        std::vector<std::vector<float>> Iris_Dataset;
        std::vector<float> temp_sepal_len;
        std::vector<float> temp_sepal_wid;
        std::vector<float> temp_petal_len;
        std::vector<float> temp_petal_wid;
        std::vector<float> temp_iris_class;

        float sepal_len_f,sepal_wid_f,petal_len_f,petal_wid_f;
        float iris_class_f;

        std::string temp_string;
        int count =0;
        
        if (myfile.is_open())
        {
        
            std::cout<< "Iris Dataset opened successfully"<<std::endl;

            while (std::getline(myfile, line)) {
            
                std::replace(line.begin(), line.end(), '-', '_');
                std::replace(line.begin(), line.end(), ',', ' ');

                std::istringstream iss(line);
                count++;

                iss >> sepal_len_f>>sepal_wid_f >> petal_len_f >>petal_wid_f >> temp_string;
                temp_sepal_len.push_back(sepal_len_f);
                temp_sepal_wid.push_back(sepal_wid_f);
                temp_petal_len.push_back(petal_len_f);
                temp_petal_wid.push_back(petal_wid_f);
                
                if(temp_string.compare("Iris_setosa") == 0){
                    iris_class_f = Iris_setosa;
                }
                
                else if (temp_string.compare("Iris_versicolor") == 0){
                    iris_class_f = Iris_versicolor;
                }
                
                else if (temp_string.compare("Iris_virginica") == 0){
                    iris_class_f = Iris_virginica;
                }
                
                else{
                    iris_class_f = Iris_unkown;
                }
                
                temp_iris_class.push_back(iris_class_f);
            }
            Iris_Dataset.push_back(temp_sepal_len);
            Iris_Dataset.push_back(temp_sepal_wid);
            Iris_Dataset.push_back(temp_petal_len);
            Iris_Dataset.push_back(temp_petal_wid);
            Iris_Dataset.push_back(temp_iris_class);  
      }
      
      else 
      {
         std::cout << "Unable to open file" << std::endl;
      }
      
      return Iris_Dataset;
    }


### <font color = "red">Class Introduction Question #4 - What is a Data Preprocessing and how do we use data preprocessing in machine learning?</a>

### The next reading for this lecture is <a href = "https://github.com/mmorri22/cse20133/blob/main/readings/lec32/Reading%2032-2.ipynb">Reading 32-2 - Data Preprocessing And Visualization In C++.</a>