# **Data Sets**

In the real world, the data sets are much bigger, but it can be difficult to gather real world data, at least at an early stage of a project.

## **Big Data Sets**
To create big data sets for testing, we use the Python module NumPy, which comes with a number of methods to create random data sets, of any size.

In [None]:
# Creatiing 250 random floats between 0 and 5
import numpy

x = numpy.random.uniform(0.0, 5.0, 250)

print(x)

## **Histogram**
While working with number of datas in the program, histogram is often used to properly have a visualization of distribution of the data sets.

In the below example,

- We use the array from the example above to draw a histogram with 5 bars.
- The first bar represents how many values in the array are between 0 and 1.
- The second bar represents how many values are between 1 and 2.

In [None]:
# Using matplotlib's pyplot to visualize the data sets distribution.

import numpy
import matplotlib.pyplot as plt

x = numpy.random.uniform(0.0, 5.0, 250)

plt.hist(x, 5)
plt.show()

## **Normal Data Distribution(Gaussian data distribution)**
If an array of data are concentrated around a single value, then it is termed as normal data distribution.

In the example below:

We use the array from the numpy.random.normal() method, with 100000 values,  to draw a histogram with 100 bars.

We specify that the mean value is 5.0, and the standard deviation is 1.0.
- Meaning that the values should be concentrated around 5.0, and rarely further away than 1.0 from the mean.
- And as you can see from the histogram, most values are between 4.0 and 6.0, with a top at approximately 5.0.

In [None]:
# Creating normal data distribution
import numpy
import matplotlib.pyplot as plt

x = numpy.random.normal(5.0, 1.0, 100000)

plt.hist(x, 100)
plt.show()

## **Scatter Plot**
A scatter plot is a type of data visualization that shows the relationship between two variables by displaying points on a two-dimensional coordinate system.

Scatter plot method from matplotlib takes two arrays of the same length, one for the values of the x-axis, and one for the values of the y-axis.

In [None]:
# Creating scatter plot for data visualization
import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()

## **Random Data Distributions**
In Machine Learning the data sets can contain thousands-, or even millions, of values.

When it comes to testing out the machine learning algorithms, huge data sets are required for which the values are to be generated randomly.

Let us create two arrays that are both filled with 1000 random numbers from a normal data distribution.

- The first array will have the mean set to 5.0 with a standard deviation of 1.0.
- The second array will have the mean set to 10.0 with a standard deviation of 2.0.

***Output***

- We can see that the dots are concentrated around the value 5 on the x-axis, and 10 on the y-axis.
- We can also see that the spread is wider on the y-axis than on the x-axis.

In [None]:
# Creating 1000 plots randomly.
import numpy
import matplotlib.pyplot as plt

x = numpy.random.normal(5.0, 1.0, 1000)
y = numpy.random.normal(10.0, 2.0, 1000)

plt.scatter(x, y)
plt.show()

## **Regression**

Regression is used when you try to find the relationship between variables.

In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future events.

### **Linear Regression**
Linear regression uses the relationship between the data-points to draw a straight line through all them.

This line can be used to predict future values.

***In the example below,***

In the example below, the x-axis represents age, and the y-axis represents speed. We have registered the age and speed of 13 cars as they were passing a tollbooth.

In [None]:
import matplotlib.pyplot as plt
from scipy import stats

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)
print(stats.linregress(x, y)) # The result -0.76 shows that there is a relationship, not perfect, but it indicates that we could use linear regression in future predictions.

def myfunc(x):
  return slope * x + intercept

mymodel = list(map(myfunc, x))
print(myfunc(9))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

## **Correlation Coefficient(R)**

R (or R²) is actually an output of linear regression that tells us how well our model fits the data, not a value we choose beforehand. It ranges from -1 to 1 (or 0 to 1 for R²), where:

- R = 1 indicates perfect positive correlation
- R = -1 indicates perfect negative correlation
- R = 0 indicates no linear correlation

For making predictions, what you want to look for is:

  - A strong R value (closer to -1 or 1)
  - Statistical significance of the relationship
  - Whether the linear relationship makes logical sense for your data

For most real-world predictions, an ***|R| > 0.7*** is considered strong, though the acceptable threshold depends on your specific field and use case.

***Key Differences between good fit and bad fit when it comes to using linear regession:***
  1.  Good Fit (R ≈ 0.99):

    - The points follow a clear upward trend
    - Points lie very close to the regression line
    - There's a strong linear relationship
    - This would be reliable for making predictions


  2.  Bad Fit (R ≈ 0.1):

    - Points are scattered randomly
    - No clear pattern or trend
    - The regression line doesn't represent the data well
    - This would be unreliable for making predictions

In [None]:
# Example 1: Bad fit (random data with low correlation)
import matplotlib.pyplot as plt
from scipy import stats

# Bad fit example
x1 = [89, 43, 36, 36, 95, 10, 66, 34, 38, 20, 26, 29, 48, 64, 6, 5, 36, 66, 72, 40]
y1 = [21, 46, 3, 35, 67, 95, 53, 72, 58, 10, 26, 34, 90, 33, 38, 20, 56, 2, 47, 15]

slope1, intercept1, r1, p1, std_err1 = stats.linregress(x1, y1)

def myfunc1(x):
    return slope1 * x + intercept1

mymodel1 = list(map(myfunc1, x1))

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(x1, y1)
plt.plot(x1, mymodel1)
plt.title(f'Bad Fit (R = {r1:.2f})')

# Good fit example
x2 = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
y2 = [15, 23, 29, 34, 40, 45, 55, 60, 65, 70]

slope2, intercept2, r2, p2, std_err2 = stats.linregress(x2, y2)

def myfunc2(x):
    return slope2 * x + intercept2

mymodel2 = list(map(myfunc2, x2))

plt.subplot(1, 2, 2)
plt.scatter(x2, y2)
plt.plot(x2, mymodel2)
plt.title(f'Good Fit (R = {r2:.2f})')

plt.tight_layout()
plt.show()

## **Polynomial Regression**

Polynomial regression, like linear regression, uses the relationship between the variables x and y to find the best way to draw a line through the data points.

It might be ideal when your data points clearly will not fit a linear regression (a straight line through all data points).

In [None]:
# Creating polynomial regression for data sets(not applicable for linear regression)
import numpy
import matplotlib.pyplot as plt

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

myline = numpy.linspace(1, 22, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()

### **R-Squared**
R-squared measures the proportion of variance explained by the model.

It is important to know how well the relationship between the values of the x- and y-axis is, if there are no relationship the polynomial regression can not be used to predict anything.

*The r-squared value ranges from 0 to 1, where 0 means no relationship, and 1 means 100% related.*

However,

  - Higher R-squared doesn't always mean better model
  - Polynomial regression can overfit, especially with higher degrees
  - Need to balance between R-squared and model complexity

In [None]:
import numpy
from sklearn.metrics import r2_score

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

print(r2_score(y, mymodel(x)))

# The result 0.94 shows that there is a very good relationship,
# and we can use polynomial regression in future predictions.

### **Prediction Using Models**

As we knew that, the model is somewhat best fit as polynomial regression since value is way closer to 1; we can use this model to predict the future y-values for corresponding x-values.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

x = [1,2,3,5,6,7,8,9,10,12,13,14,15,16,18,19,21,22]
y = [100,90,80,60,60,55,60,65,70,70,75,76,78,79,90,99,99,100]

mymodel = np.poly1d(np.polyfit(x, y, 3))

my_new_value = mymodel(17)
print(my_new_value)

myline = np.linspace(1, 22, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
# Add just one point to highlight the prediction
plt.scatter(17, my_new_value, color='red', s=100)  # red point, slightly larger size
plt.show()

### **Bad Fit (For Polynomial Regression)**
If you get a very low r-squared value(closer to 0), then it will be a bad fit.

Refer the example below for understanding.

In [None]:
import numpy
from sklearn.metrics import r2_score

x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

mymodel = numpy.poly1d(numpy.polyfit(x, y, 3))

print(r2_score(y, mymodel(x))) # which is very low i.e close to zero.

myline = np.linspace(5, 95, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
# This plotted line doesnot signify any of the scattered plots in the given space.
# Thus, by evaluating the r-squared, we can determine if polynomial regression will be a good or bad fit for the given dataset.