## Regression Analysis 
## Forecasting Sales based on advertising costs through parallel computing  with DPC++

The following code is part of our article submission to Codeproject & INTEL Devmesh.
Showcase converting single thread algorithms to parallel

We  Implement  two  Statistical Mathematical Algorithms such as Pearson’s Correlation Coefficient & Linear Regression  with DPC++  and  show you how to implement this algorithms in real life in sales and marketing  to  forecast Future sales based on advertising expenditure.

We parallelize the computation of x_squared ,y_squared and xy

We also give you a real world example on  how regression analysis can be used in the sales and marketting department 

to predict and forecast future sales based on how much money they spend on advertising .

To read the article visit here :https://devmesh.intel.com/projects/pearson-s-correlation-coefficient-linear-regression-with-dpc

DPC++ is amazingly fast and efficient and allows you to use CPU ,GPU or FPGA Devices locally or on Intel's cloud. 



### Scenario:
You are hired as the sales and marketting manager for Intel.

You already know that spending money on advertising on facebook or linkedin gets you views that convert to sales so

Based on Intel's previous weekly expentiture on advertising and sales generated from them .

You now want to forecast what the expected sales would be  for next week if you were to increase the marketing budget to $50 


### Answer:
Using the formula y=a+(b*50) We forcast that spending \\$50 on advertising can result in \\$1367.46 in sales


### The Regression formula:
![Image of formula](Assets/regression.bmp)

### The pearsons corelation  formula:
<img src="Assets/pearson.gif" width="300px"> </img>





### To Build and Save The Code to the lab folder
Select the grey cell below and click Run ▶ to compile and execute the code above:

In [1]:
%%writefile lab/regression.cpp

//==============================================================
// Copyright © 2021 Intel Corporation
// Author:Prilvesh Krishna
// Email:prilcool@hotmail.com    
// Linkedin:https://www.linkedin.com/in/prilvesh-k-4349ba54/
// Date:03/02/2020        
// SPDX-License-Identifier: MIT
// =============================================================



#include<CL/sycl.hpp>
#include<array>
#include<iostream>
#include<cmath>
#include <math.h> 
#include<iomanip>
#include<limits>
#include <chrono>



using namespace sycl;

// we create a custom pow function because GPU doesnt know default pow function
double pow(double base_unit, double power_unit){
    double result=base_unit;
    int i=1;
    while(i<power_unit){
        result=result*base_unit;
        i++;
    }
    
  return result;  
}



int main() {
    auto start_time = std::chrono::high_resolution_clock::now();

    //N specifies the numbe of values in your dataset.
    
    constexpr int N=6; 
    
    // if you select a GPU device than Device: Intel(R) Graphics Gen9 [0x3e96] will process in 1 second.
     queue q(gpu_selector{});
    
    //if you select a CPU device than Device: Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz will process in 3 seconds.
    //to use cpu un comment the below and comment the above.
    //queue q(cpu_selector{});
    
    std::cout << "Device: " << q.get_device().get_info<info::device::name>() << std::endl;


    // load the  x and y datasets into the two arrays 
    // you can load the data from a file but for the purpouses of this demo we place it directly so that you can play around with it
    // using the jupyter notebook and input your own x and y values for testing etc .
    
     // If you want to use alternate datasets its easy to specify like so 
     //int x[]={41,54,63,54,48,46,62,61,64,71};  // the amounts you spend on advertising each week
     //int y[]={1250,1380,1425,1425,1450,1300,1400,1510,1575,1650}; //the sales you get each week
    
     
    
     int x[]={43,21,25,42,57,59}; // the amounts you spend on advertising each week
     int y[]={99,65,79,75,87,81};//the sales you get each week
    
    int sample_forecast=50; //you want to spend $50 next week and forecast how much sales you will make.    

    // declaring your variables
    
    int sum_x=0;  //sum of x values  
    int sum_y=0;  //sum of y values  
    int sum_xy=0; //sum of xy values  
    int sum_x_squared=0; //sum of x  squared values  
    int sum_y_squared=0; //sum of y  squared values  
    
    int*xy=malloc_shared<int>(N, q); //to hold xy calculated values  
    int*x_squared=malloc_shared<int>(N, q); // to hold x_squared calculated values  
    int*y_squared=malloc_shared<int>(N, q); // to hold y_squared calculated values

    
    // We define the incrementor 
    
    for (int i =0; i < N; i++) {
        i=i;  
    }
    
    // display text oon screen
    std::cout << "Parallel data processing initialized" << std::endl;
   
    
    //we do calculation of  x*y in parallel
    
    q.parallel_for(range<1>(N), [=](id<1> i) {
       xy[i]=x[i]*y[i];
    }).wait();

    //we do calculation of  x_squared in parallel
    
    q.parallel_for(range<1>(N), [=](id<1> i) {
       x_squared[i]=pow(x[i],2);
    }).wait();

     //we do calculation of  y_squared  in parallel
    
    q.parallel_for(range<1>(N), [=](id<1> i) {
       y_squared[i]=pow(y[i],2);
    }).wait();
    
    //Next we calculate the  sum_x ,sum_y , sum_x_squared ,sum_y_squared  sum_xy

    for (int i =0; i < N; i++) {
     sum_x+=x[i];
     sum_y+=y[i];   
     sum_x_squared+=x_squared[i];
     sum_y_squared+=y_squared[i]; 
     sum_xy+=xy[i];   
    }
    
    
    //Next we calculate the  Intercept coefficient
    
     double a=((sum_y * sum_x_squared)-(sum_x * sum_xy)) / (N * (sum_x_squared)-pow(sum_x,2)); 

    //Next we calculate the  Slope coefficient 
    
     double b=(N*(sum_xy)-(sum_x * sum_y))/(N*(sum_x_squared)-pow(sum_x,2));
    
    
    
    //Now we can run a sample forcast using the sample linear regression formula  y=a+b*(x)
    
     double Sales_regression_function=a+(b*(sample_forecast));
    
    
    //we can now also do the pearsons corelation 
    //The range of the correlation coefficient is from -1 to 1. 
    //Our result is 0.5298 or 52.98%, which means the variables have a moderate positive correlation.
    
   double pearson_r=(N*(sum_xy)-(sum_x*sum_y))/sqrt((N*(sum_x_squared)-pow(sum_x,2))* (N*(sum_y_squared)-pow(sum_y,2)));

    
    
    std::cout << "Proceeding to output results to txt file "<< std::endl;  
    std::cout<<" "<<std::endl;
   
    // we write out forcast and working data to a regression.txt text file if you have alot of data  so its better  so we can save it etc.
    // create new or append to regression.txt file.
    
     std::ofstream out("regression.txt", std::ios::app);
     
     out<<" "<<std::endl;
     out<<"The Pearsons correlation is "<<pearson_r<< std::endl;
    
     out << "Using the formula y=a+(b*"<<sample_forecast<<")"<<" We forcast that spending $"<<sample_forecast<<" on advertising can result in $"
     <<Sales_regression_function<<" in sales"<< std::endl;
 
    out<<" "<<std::endl;
    
    out<<"Below is Working Data incase you want to view the data that was calculated parallely"<<std::endl;
    out<<" "<<std::endl;
    
     out<<"Number of dataset values N is "<<N<<std::endl;
     out<<"Sum of X values "<<sum_x<<std::endl;
     out<<"Sum of Y values "<<sum_y<<std::endl; 
     out<<"Sum of X squared values "<<sum_x_squared<<std::endl;
     out<<"Sum of Y squared values "<<sum_y_squared<<std::endl; 
     out<<"Sum of XY values "<<sum_xy<<std::endl;
    
     out<<" "<<std::endl;
    
    out<<"X Value,"<<" Y Value,"<<"XY,"<<"X Squared,"<<"Y Squared,"<<std::endl;
    for (int i = 0; i < N; i++) {
   
        out<< x[i] <<","<< y[i] <<","<< xy[i] << "," << x_squared[i]<< ","  << y_squared[i]<<std::fixed << std::setprecision(2)<< std::endl;
    }
    
  
    std::cout<<" "<<std::endl;
    std::cout << "Processing complete you can now refer to the regression.txt  on the left hand side in jupyter notebook" << std::endl;
     
    auto current_time = std::chrono::high_resolution_clock::now();

     std::cout << "The Processing was completed in  " << std::chrono::duration_cast<std::chrono::seconds>(current_time - start_time).count() << " seconds" << std::endl;

    //release the occupied memory,if you dont release it you will have memory leaks.
    
    free(xy,q);
    free(x_squared,q);
    free(y_squared,q);
    return 0;
}









Overwriting lab/regression.cpp


In [2]:
! chmod 755 q; chmod 755 run_regression.sh;if [ -x "$(command -v qsub)" ]; then ./q run_regression.sh; else ./run_regression.sh; fi 

Job has been submitted to Intel(R) DevCloud and will execute soon.

 If you do not see result in 60 seconds, please restart the Jupyter kernel:
 Kernel -> 'Restart Kernel and Clear All Outputs...' and then try again

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
798469.v-qsvr-1            ...ub-singleuser u60146          00:00:45 R jupyterhub     
798480.v-qsvr-1            ...regression.sh u60146                 0 Q batch          

Waiting for Output █████████████████████ Done⬇

########################################################################
#      Date:           Sun 21 Feb 2021 12:08:35 AM PST
#    Job ID:           798480.v-qsvr-1.aidevcloud
#      User:           u60146
# Resources:           neednodes=1:gpu:ppn=2:gen9,nodes=1:gpu:ppn=2:gen9,walltime=06:00:00
########################################################################

## u60146 is compiling your interes

_If the Jupyter cells are not responsive or if they error out when you compile the code samples, please restart the Jupyter Kernel: 
"Kernel->Restart Kernel and Clear All Outputs" and compile the code samples again_