# Project 1
This jupyter notebook contains the first project for the programming and data analytics module

##  Index

1. [Objetives](#objetives)
2. [Used_technologies](#used-technologies)
3. [Theme of the project](#theme-of-the-project)
4. [Variables](#variables)
5. [Creation of the database](#creation-of-the-database)
6. [Analysis](#analysis)
7. [Conclusions](#conclusions)
8. [References](#references)



## Objetives

Specifically, this project contains the following:

1. A real-world phenomenon that can be measured and for which at least one hundred data points can be collected, with at least four different variables.
2. Investigating the types of variables involved, their likely distributions, and their relationships to each other and to each other.
3. Synthesize/simulate a data set as close as possible to its properties.

## Used technologies

For this project I have to create a dataset simulating a real world phenomenon and model and synthesize that data using Python.
For this purpose I use the numpy.random package in addition to other packages such as matplotlib to visualize and model the data.
The programming language used in this notebook is Python.

You can find information about these technologies in the following links:

https://numpy.org/doc/stable/reference/random/index.html# 

https://matplotlib.org/stable/tutorials/index.html

https://docs.python.org/3/

https://docs.python.org/3/library/statistics.html




## Theme of the project

For the elaboration of the project I have chosen as the main topic walking as a physical activity and its relation with health.

Walking is a basic and inherent human activity; it was our main transportation mechanism. We did it out of necessity, not by will, since in ancient times our ancestors were hunter-gatherer societies.
Nowadays we can get an idea of this lifestyle based on current primitive societies, such as the Hazda[1] who underwent a study[2] to observe how many daily steps they took, with a daily estimate of 9 km for women and 15 km for men.

With the advent of industrialization, with the invention of cars, office jobs and elevators all this changed. We humans became more sedentary due to our changing lifestyles.

We used to spend a large part of the day on the move and now through our mobile phone we can order food without any dose of movement. 
This lack of movement, along with the intake of industrial and processed foods and bad habits, has had a negative impact on health.  
According to the following study[3] from the National Health and Nutrition Examination Survey (NHANES), nearly 1 in 3 adults (30.7%) are overweight and more than 2 in 5 adults (42.4%) are obese in the United States.  

This raises awareness of the need to include physical activity in our lives, as there are many known benefits of walking:
1. It is one of the easiest, most accessible and low-impact physical activities and does not require a large investment of money, apart from comfortable clothes and shoes.  
2. It is good for the heart, as it lowers blood pressure.[4] 
3. Despite being a low intensity activity, it is good for weight loss. 
4. It is good for managing stress and improving creativity.[5] 

The best known and most famous recommendation is to take 10,000 steps, originated in the 1960s by a pedometer called manpo-kei.
However, there are numerous studies that indicate that with fewer steps, multiple benefits are already achieved, as explained in a meta-analysis presented by Lancet[6].  

Based on the information presented and the research, for the creation of the database, I will use the following variables:
1. Age
2. Sex
3. Number of daily steps
4. Occupation (manual or non-manual work)
5. Smoker (no or yes)
6. BMI 


For our study we will take a population sample n= 100, i.e. we will have 100 observations of different subjects with different characteristics (age, sex...) that will compose our field of study.  

The explanation of each variable can be found in the following section.  
We will see what type of variable we are dealing with, whether it is discrete or continuous, and what distribution 
function applies to each variable, i.e. the way in which the values of a variable are spread or distributed in a sample or population.

To do this, it will be necessary to first import the corresponding libraries that allow us to observe the graphs and generate these random numbers.  Importing these libraries will also help us to generate our database.  



In [52]:
import numpy as np
import statistics
import matplotlib as plt


## Variables

### Age 

Age is a continuous numeric variable. Its is also a Interval;these are variables that  represent a numerical measurement on a continuous scale where the differences between values are significant and consistent. In the case of age, the numbers represent a continuous quantity of years, and the differences between ages are constant and meaningful.  

For this project I have set an age range from 18 to 70 years old.  


First of all, before generating all random variables, it is recommended to use the "seed" feature. By passing an integer as an argument to seed(), this will generate the same sequence of random numbers, a useful feature for reproducibility of results and so that the samples do not vary each time our code is run.


In [53]:
np.random.seed(1)

# Generates a random set of ages, from 18 to 70 years old
age = np.random.randint(18,70, 100)
age

array([55, 61, 30, 26, 27, 29, 23, 33, 18, 34, 19, 30, 25, 63, 24, 43, 68,
       38, 55, 36, 38, 29, 60, 46, 47, 32, 68, 22, 41, 41, 59, 67, 48, 50,
       40, 31, 59, 27, 25, 40, 19, 18, 35, 26, 42, 31, 69, 65, 60, 26, 48,
       25, 21, 24, 39, 67, 21, 22, 42, 67, 61, 30, 44, 34, 63, 69, 59, 36,
       33, 18, 22, 43, 65, 52, 41, 25, 44, 43, 58, 40, 27, 21, 57, 41, 54,
       45, 55, 37, 56, 26, 50, 52, 28, 41, 33, 65, 41, 43, 25, 69])

We have generated a matrix n = 100 with values between 18 and 70 and we can observe the randomness by seeing how many times the maximum and minimum values are repeated.

In [54]:
min_value = min(age)
max_value = max(age)

print(f"The minimum value is ({min_value}).")
print(f"The maximum value is ({max_value}).")

The minimum value is (18).
The maximum value is (69).


We can observe that given n = 100 the minimum number appears (18) but not the maximum number (70). Using the following code we can count how many times the minimum value appears.

In [55]:
count = 0
for num in age:
    if num == 18:
        count += 1
print(count)        

3


The minimum value is repeated 3 times. 
In case of generating a larger sample, possibly the highest value will be generated, as the random function assumes a uniform distribution, meaning that each integer within the range has the same probability of being selected.

In the case of the variable age, to see the distribution it follows given our sample size 100, we have to obtain the mean and the standard deviation.

1. The mean is the average value of a set of numbers, obtained by adding up all the values and dividing by the total number of elements.

2. The standard deviation measures how dispersed the values are with respect to the mean; it is the square root of the variance and provides a measure of the variability in a data set.

In [56]:
# Average age of the sample
np.mean(age)


41.2

In [57]:
# Median age of the sample
np.median(age)

40.5

## Creation of the database

## Analysis

## Conclusions

## References

[1] https://education.nationalgeographic.org/resource/hadza/  
[2] https://www.nature.com/articles/s41562-020-01002-7  
[3] https://www.niddk.nih.gov/health-information/health-statistics/overweight-obesity  
[4] https://ajcn.nutrition.org/article/S0002-9165(23)23351-1/fulltext  
[5] https://news.stanford.edu/2014/04/24/walking-vs-sitting-042414/  
[6] https://www.thelancet.com/journals/lanpub/article/PIIS2468-2667(21)00302-9/fulltext#seccestitle140  
