# **Data Sets**

In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project.

## **Big Data Sets**
To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.

In [None]:
# Creatiing 250 random floats between 0 and 5
import numpy

x = numpy.random.uniform(0.0, 5.0, 250)

print(x)

## **Histogram**
While working with number of datas in the program, histogram is often used to properly have a visualization of distribution of the data sets.

In the below example,

- We use the array from the example above to draw a histogram with 5 bars.
- The first bar represents how many values in the array are between 0 and 1.
- The second bar represents how many values are between 1 and 2.

In [None]:
# Using matplotlib's pyplot to visualize the data sets distribution.

import numpy
import matplotlib.pyplot as plt

x = numpy.random.uniform(0.0, 5.0, 250)

plt.hist(x, 5)
plt.show()

## **Normal Data Distribution(Gaussian data distribution)**
If an array of data are concentrated around a single value, then it is termed as normal data distribution.

In the example below:

We use the array from the numpy.random.normal() method, with 100000 values,  to draw a histogram with 100 bars.

We specify that the mean value is 5.0, and the standard deviation is 1.0.
- Meaning that the values should be concentrated around 5.0, and rarely further away than 1.0 from the mean.
- And as you can see from the histogram, most values are between 4.0 and 6.0, with a top at approximately 5.0.

In [None]:
# Creating normal data distribution
import numpy
import matplotlib.pyplot as plt

x = numpy.random.normal(5.0, 1.0, 100000)

plt.hist(x, 100)
plt.show()

## **Scatter Plot**
A scatter plot is a type of data visualization that shows the relationship between two variables by displaying points on a two-dimensional coordinate system.

Scatter plot method from matplotlib takes two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis.

In [None]:
# Creating scatter plot for data visualization
import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()

## **Random Data Distributions**
In Machine Learning the data sets can contain thousands-, or even millions, of values.

When it comes to testing out the machine learning algorithms, huge data sets are required for which the values are to be generated randomly.

Let us create two arrays that are both filled with 1000 random numbers from a normal data distribution.

- The first array will have the mean set to 5.0 with a standard deviation of 1.0.
- The second array will have the mean set to 10.0 with a standard deviation of 2.0.

***Output***

- We can see that the dots are concentrated around the value 5 on the x-axis, and 10 on the y-axis.
- We can also see that the spread is wider on the y-axis than on the x-axis.

In [None]:
# Creating 1000 plots randomly.
import numpy
import matplotlib.pyplot as plt

x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)

plt.scatter(x, y)
plt.show()

## **Regression**

Regression is used when you try to find the relationship between variables.

In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events.

### **Linear Regression**
Linear regression uses the relationship between the data-points to draw a straight line through all them.

This line can be used to predict future values.

***In the example below,***

In the example below, the x-axis represents age, and the y-axis represents speed. We have registered the age and speed of 13 cars as they were passing a tollbooth.

In [None]:
import matplotlib.pyplot as plt
from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
  return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()