# Scaling

**Note:** You will need to search many new functions in NumPy documentation for many of these tasks. These are the kind of open-ended questions that a data scientist needs to tackle. This exercise will be a good learning experience. 

## 1. Creating a dataset

Let's create a hypothetical dataset with data about athletes, their physical attributes, and their sports records. 

### 1.1 
Create an array to identify each athlete uniquely. The array should have all unique five-digit integers. The shape of the array must be one thousand elements rows. (1000)

In [80]:
# importing necessary libraries 
import numpy as np
# Seting a seed so that the random command generates similar numbers for everyone.  
np.random.seed(0) 
# Suppressing the scientific notation so you can read numbers in the output. 
np.set_printoptions(suppress=True)

### 1.2 
Check if all the entries in the "athlete_id" column are unique.

**Hint:** Find the number of unique entries in the athlete_id array and compare that to the total number of elements in the array. 

Seven values repeat twice. You will have to remove duplicates and add three unique numbers. 

### 1.3 
Find the duplicate values and remove them. 

Now we have an array of 1000 unique numbers in a column format. 

So in the dataset, we have the first column set.

### 1.4 
Create a column for the height of the athletes. 

The height column must follow a few rules. 

1. Height of an athlete is a natural variable. So it would be best if you used normal distribution to create the height data. 
2. You also need to fix the maximum and minimum heights.
3. Height need not be unique. 
4. Height needs to be in whole centimeters (integers)

Assumptions - 
1. Shortest person's height - 160 cm 
2. Tallest person's height - 240 cm
3. Average person's height - 200 cm

The problem with normal distribution is that there will always be outliers. For instance, in the list above, you can find values like 138, 272 which are undesirable. Although in real life, you will find people who are exceptionally tall or short. But in this case, we want to consider the people with heights between 160 and 240. 

So you will have to replace those values with other values from the acceptable range. 

Now, the task is to ensure that all the randomly picked values are nomally distributed and fall within an acceptable range. 

Let's think of ways of meeting all the said conditions. 

1. Create a larger sample of normally distributed integers than needed
2. Use the filter command to extract all the numbers between the decided range. 
3. Finally, you need to ensure the number of observations created to the required number. 

There is a flow with this method. In the end, you might not end up with a normally distributed range. So it might not work. 

As an alternative, there is a more advanced library called scipy, it builds on top of NumPy and adds a lot more functionality. The scipy library has a function called **truncnorm**

Reading material for the function:

1. [Official documentation for truncnorm](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.truncnorm.html)

2. [Explanation for the usage of truncnorm](https://stackoverflow.com/questions/18441779/how-to-specify-upper-and-lower-limits-when-using-numpy-random-normal)

Since the scipy library is not covered, the following code block will create the necessary observations. 

In [68]:
import scipy.stats

lower = 160
upper = 240
mu = 200
sigma = 20
N = 1000

samples = scipy.stats.truncnorm.rvs(
          (lower-mu)/sigma,(upper-mu)/sigma,loc=mu,scale=sigma,size=N)

### 1.5 
Create a length of the jump array. 

The length of an athlete's jump is a variable that depends on a lot of different factors, like the physical strength of the athlete, training, their natural talent and so on. So it will be fair to draw variables within a limit randomly. The length is measured in meters, with the least count 1 cm. Create the length of the jump parameter for the data. 

The length of the average long jump is 6 to 7 meters. The world record is 8.95 m. Hence, let's assume the highest long jump that we recorded in our data set to be 8.00m and the lowest limit to be 5.00m. 

To generate random numbers between 5 and 8. with uniform distribution. 
Why uniform distribution? Again the same most values will be close to the mean; rarely people will be good or bad. 

So now we have three arrays, unique athlete ID, normally distributed height array and uniformly distributed jump vector. 

### 1.6 
Combine the three arrays to create a dataset. 

The form of the dataset should be 

"Athlete_id", "Height", "Long_jump" and total 1000 reading of this form. 

hint: 
Take the sequences of 1-D arrays and stack them as columns to make a single 2-D array. 

Now you have created a dataset in the form that you need. 

## 2 Scaling Data

The next task is to perform a unit conversion so that the height (currently in cms) and length of jump (currently in m) are in the same unit. 


### 2.1 
Let's convert the height of the students to meters. 

Divide all values in the middle column by 100. 

#### Concept of vector division 

assume you have a vector 

a = 4,6,12

You want to divide the first element by 2, the second by three and the third by 6. 

#### Example 1

In [85]:
a = np.array([4,6,12])

In [88]:
b = np.array([2,3,6])

In [89]:
np.divide(a,b)

array([2., 2., 2.])

#### Example 2

In [90]:
a = np.array([[4,6,12], [6, 9, 18]])

In [91]:
a

array([[ 4,  6, 12],
       [ 6,  9, 18]])

In [92]:
np.divide(a,b)

array([[2., 2., 2.],
       [3., 3., 3.]])

You can use the divide function to perform row/column-wise operations. There are rules to follow while ensuring that such procedures do not through errors. You can read about thse rules [here](https://www.geeksforgeeks.org/python-broadcasting-with-numpy-arrays/#:~:text=Broadcasting%20Rules%3A&text=The%20two%20arrays%20are%20compatible,are%20compatible%20with%20all%20dimensions.)

This is called dimensional compatibility. You will learn more about this later in the program. 

### 2.2 
Calculate the length of the jump to height ratio. Higher this ratio better will be the student at the long jump. 

You will have to slice the array to extract the required columns from the matrix, perform the operation, and ultimately add the new column to the original matrix. 

### 2.3 
Find the candidate ID with the highest length to height ratio. 

## Conclusion 

1. You learnt the use of official documentation and tools like stack overflow to reduce your effort in writing code. 
2. You used NumPy and scipy to create a dataset and perform operations on it 
3. You performed EDA found out the most efficient jumper. 