# Guon Dataset: Introduction to non-linear Regression using Tensorflow

## Source

https://github.com/rabah-khalek/TF_tutorials

## Learning Goals##
This notebook will serve as an introduction to the non-linear regression as well as the new extremely powerful TensorFlow library for Machine Learning (ML) from Google. We will also learn how to use the versatile Pandas package for handling data.


## Overview##
Throughout, we will work with the [Gluon dataset](https://github.com/rabah-khalek/TF_tutorials/tree/master/PseudoData). It computed using the [LHAPDF](https://lhapdf.hepforge.org) open source code, a general purpose C++ and python interpolator, used for evaluating PDFs from discretised data files.

Here is the description of the Gluon dataset we will be playing around with for this notebook:
>A gluon is an elementary particle that acts as the exchange particle (or gauge boson) for the strong force between quarks. It is analogous to the exchange of photons in the electromagnetic force between two charged particles. 

>In technical terms, gluons are vector gauge bosons that mediate strong interactions of quarks in quantum chromodynamics (QCD). Gluons themselves carry the color charge of the strong interaction. This is unlike the photon, which mediates the electromagnetic interaction but lacks an electric charge. Gluons therefore participate in the strong interaction in addition to mediating it, making QCD significantly harder to analyze than QED (quantum electrodynamics).

>Because of the inherent non-perturbative nature of partons(quarks and gluon in general) which cannot be observed as free particles, parton densities cannot be calculated using perturbative QCD.
Parton distribution functions are obtained by fitting observables to experimental data; they cannot be calculated using perturbative QCD.

> The parton density function $f_i(x,Q)$ gives the probability of finding in the proton a parton of flavour $i$ (quarks or gluon) carrying a fraction $x$ of the proton momentum with $Q$ being the energy scale of the hard interaction. Cross sections are calculated by convo- luting the parton level cross section with the PDFs. Since QCD does not predict the parton content of the proton, the shapes of the PDFs are determined by a fit to data from experimental observables in various processes, using the DGLAP evolution equation.

> This PseudoData is computed from such fit performed by the [NNPDF collaboration](http://nnpdf.mi.infn.it) that determines the structure of the proton using contemporary methods of artificial intelligence. NNPDF determines PDFs using as an unbiased modeling tool Neural Networks, trained using Genetic Algorithms and recently stochastic Gradient descent, and used to construct a Monte Carlo representation of PDFs and their uncertainties: a probability distribution in a space of functions.


## Importing the Gluon data set with Pandas

The dataset is a total of 1000 gluon PDF predictions computed between $x=[10^{-6},1]$ for $Q=2\,GeV$.  
<b> Exercise:</b> In what follows, use Pandas to import a random 800 x-points and call that the training data and import the rest 200 x-points and call that the test data.


In [3]:
# Importing the Gluon Data set
import sys, os
import pandas as pd
from sklearn.model_selection import train_test_split

import numpy as np
import warnings
#Commnet the next line on to turn off warnings
#warnings.filterwarnings('ignore')


seed=12
np.random.seed(seed)
import tensorflow as tf
# suppress tflow compilation warnings
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

tf.set_random_seed(seed)

# Download the SUSY.csv (about 2GB) from UCI ML archive and save it in the same directory as this jupyter notebook
# See: https://archive.ics.uci.edu/ml/machine-learning-databases/00279/
#filename="SUSY.csv"
filename='PseudoData/gluon_NNPDF31_nlo_pch_as_0118.dat' 

lines_to_skip = 5

columns=["x", "gluon_cv", "gluon_sd"]
# Load 800 rows as train data, 200 as test data

df = pd.read_csv(filename, 
                 sep="\s+", 
                 skiprows=lines_to_skip, 
                 usecols=[0,1,2], 
                 names=columns)

df_train, df_test = train_test_split(df, test_size=0.2)

df_train = df_train.sort_values("x")
df_test = df_test.sort_values("x")

print("Data parsing is done!")


Data parsing is done!
