# Practice in Python - Statistics Module I

In this session, I will enhance my skills in Statistics, utilizing:
- Simple, Systematic, and Stratified Sampling;
- Measures of Central Tendency and Variability;
- Normal Distribution;
- Normality Tests;

In [47]:
import pandas as pd 
import numpy as np
from math import ceil
from sklearn.model_selection import train_test_split

In [3]:
dtiris = pd.read_csv('iris.csv')
dtiris

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [45]:
dtinfert = pd.read_csv('infert.csv')
dtinfert

Unnamed: 0.1,Unnamed: 0,education,age,parity,induced,case,spontaneous,stratum,pooled.stratum
0,1,0-5yrs,26,6,1,1,2,1,3
1,2,0-5yrs,42,1,1,1,0,2,1
2,3,0-5yrs,39,6,2,1,0,3,4
3,4,0-5yrs,34,4,2,1,0,4,2
4,5,6-11yrs,35,3,1,1,1,5,32
...,...,...,...,...,...,...,...,...,...
243,244,12+ yrs,31,1,0,0,1,79,45
244,245,12+ yrs,34,1,0,0,0,80,47
245,246,12+ yrs,35,2,2,0,0,81,54
246,247,12+ yrs,29,1,0,0,1,82,43


----
# Simple Sampling
Simple random sampling is a random selection of elements from a population, where each element has an equal chance of being chosen, being a basic technique of statistical sampling.

In [4]:
#Creating a random sampling using the iris.csv database.

#We store a random seed for experimentation and keep the results in stock, even though it's a random sample.
np.random.seed(2345)

simple_sample = np.random.choice(a= [0,1], size=150, replace=True, p=[0.7,0.3])
#Creation of the random sample:
    # (a=[0,1]) -> The sample will be an array of zeros (0) and ones (1)
    # (size=150) -> The size of the sample will be 150 rows (equal to the iris.csv database)
    # (replace=True) -> Allows for replacement of numbers, once a 0 is selected, it goes back to the pool of options for selection again (until size=150)
    # (p=[0.7,0.3]) -> Probability of choosing between elements a=[0,1] {0 has a probability of 0.7 -- 1 has a probability of 0.3}

#Returns the sample
simple_sample

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1])

In [7]:
#Checking the size and data of the sample
simple_sample_size = len(simple_sample)
simple_sample_zero = len(simple_sample[simple_sample == 0])
simple_sample_one = len(simple_sample[simple_sample == 1])

print(f'The sample size is {simple_sample_size}')
print(f'Number of elements zero (0): {simple_sample_zero}')
print(f'Number of elements one (1): {simple_sample_one}')

The sample size is 150
Number of elements zero (0): 101
Number of elements one (1): 49


In [9]:
#Using the generated sample vector to separate the elements from the iris.csv dataset (population) randomly
simple_sample_final = dtiris.loc[simple_sample == 0]
simple_sample_final

#We could also leave [sample == 1], which would be our other sample of 49 records.

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
...,...,...,...,...,...
140,6.7,3.1,5.6,2.4,Iris-virginica
143,6.8,3.2,5.9,2.3,Iris-virginica
144,6.7,3.3,5.7,2.5,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica


In this scenario, we have the iris.csv population dataset - The purpose of this study was to generate a random and unbiased simple sample from this population.
1. We used a random seed to store our sample result, allowing us to always arrive at the same stored result.
2. We created a random sample of the same size as the dataset (size=150), composed of zeros and ones, with a probability of 0.7 ~ 0.3 respectively.
3. We used .loc to select the records [sample == 0] from the sample in the dataset (population).
4. Sample generated randomly and without bias.

----
# Systematic Sampling
Systematic sampling is a sampling technique where elements are selected according to a predefined pattern, such as every k-th element of the population, ensuring representativeness and efficiency in data collection.

In [12]:
#Creating our variables for systematic sampling and defining the pattern of k in the population elements.

population = 150                    #Same number of records as in the iris.csv dataset (real population).
sys_sample = 15                     #The sample will consist of 15 randomly selected records.
k = ceil(population/sys_sample)     #Every 10 records from the population, 1 will be added to the sample.
print(k)


10


In [32]:
#Of those 10 records, when will it add 1 record to the sample? Setting up the random initialization of the sample

r = np.random.randint(low=1, high=k+1, size=1)
print(r)
#We use (.randint) to generate random integers:
    #(low=1) -> sets the lower limit of the interval to 1;
    #(high=k+1) -> sets the upper limit of the interval, in this case 11 (Python excludes the last value, so k+1 for it to consider from 1 to 10);
    #(size=1) -> specifies the output size, generates only 1 random integer;

[7]


In [40]:
#We create a loop structure to sum the next values, based on r.
box = r[0]
drawn = []

for i in range(sys_sample):
    drawn.append(box)
    box += k 


In [43]:
#Result:
print(f'The size of the drawn sample is [{len(drawn)}]')
print(f'The elements composing the sample are: {drawn}')


The size of the drawn sample is [15]
The elements composing the sample are: [7, 17, 27, 37, 47, 57, 67, 77, 87, 97, 107, 117, 127, 137, 147]


In [44]:
#Linking the drawn values with the real population, and filtering the selected elements by the sample.
sys_sample_final = dtiris.loc[drawn]
sys_sample_final

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
7,5.0,3.4,1.5,0.2,Iris-setosa
17,5.1,3.5,1.4,0.3,Iris-setosa
27,5.2,3.5,1.5,0.2,Iris-setosa
37,4.9,3.1,1.5,0.1,Iris-setosa
47,4.6,3.2,1.4,0.2,Iris-setosa
57,4.9,2.4,3.3,1.0,Iris-versicolor
67,5.8,2.7,4.1,1.0,Iris-versicolor
77,6.7,3.0,5.0,1.7,Iris-versicolor
87,6.3,2.3,4.4,1.3,Iris-versicolor
97,6.2,2.9,4.3,1.3,Iris-versicolor


In this scenario, we conducted the sampling process systematically:

1. We defined the variables to measure the pattern of data collection. In this case, we desired a sample of 15 records from a population of 150 records. We found the value of k, which systematically defines how many population records will be attached to the sample (in this scenario, every 10 population records, 1 record is attached to the sample).
2. Randomly, we determined which of these 10 records will be allocated to the sample (in this case, r=7). Note: every time the code is executed, the value of [r] changes.
3. We executed a loop to store these 15 sample records in an array/list.
4. We filtered the sample with the population database.

----
# Stratified Sampling
Stratified sampling is a sampling technique where the population is divided into homogeneous groups (strata), and a sample is randomly selected from each stratum, ensuring proportional representativeness of each segment of the population.

### 01. Example of Balanced Stratification:

In [48]:
#Analyzing and counting how many records exist in the column from which we will generate the stratified sample
dtiris['class'].value_counts()


class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

In [57]:
#We use the train_test_split method to generate our stratified sample
x_iris, _, y_iris, _ = train_test_split(dtiris.iloc[:,0:4], dtiris.iloc[:,4], test_size=0.5, stratify= dtiris.iloc[:,4])
y_iris.value_counts()

# Uses the train_test_split function to split the data into training and testing sets
    # (dtiris.iloc[:,0:4]) -> The input data are the first four columns (features).
    # (dtiris.iloc[:,4]) -> The labels are the values from the fifth column (labels).
    # (test_size=0.5) -> The test_size parameter defines the proportion of the test set, in this case, 50%
    # (stratify= dtiris.iloc[:,4]) -> Ensures that the labels are stratified, maintaining the same class proportion in the training and testing sets
    # The result is assigned to the variables x_iris (features), y_iris (training labels),
    # (_) Two unused variables to store the testing sets


class
Iris-setosa        25
Iris-virginica     25
Iris-versicolor    25
Name: count, dtype: int64

In [59]:
#Linking and filtering the sample values with the population values
str_sample_balanced = dtiris.loc[y_iris.index]
str_sample_balanced

Unnamed: 0,sepal length,sepal width,petal length,petal width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
101,5.8,2.7,5.1,1.9,Iris-virginica
14,5.8,4.0,1.2,0.2,Iris-setosa
92,5.8,2.6,4.0,1.2,Iris-versicolor
49,5.0,3.3,1.4,0.2,Iris-setosa
...,...,...,...,...,...
67,5.8,2.7,4.1,1.0,Iris-versicolor
95,5.7,3.0,4.2,1.2,Iris-versicolor
18,5.7,3.8,1.7,0.3,Iris-setosa
15,5.7,4.4,1.5,0.4,Iris-setosa


About the process:

- **x_iris** stores the feature data (i.e., columns) specified in *dtiris.iloc[:,0:4]*|. Here, *iloc[:,0:4]* selects all rows from the columns first to fourth (0 to 3) in dtiris, which correspond to the features that will be used to train the model.

- **y_iris** stores the label data, which is taken only from the column specified in *dtiris.iloc[:,4]*. Here, *iloc[:,4]* selects all rows from the fifth column (index 4) in dtiris, which contains the labels.

- **test_size** is the parameter determining the proportion of data to be used as the test set. In this case, test_size=0.5 means half of the data will be used for testing.

- **stratify=dtiris.iloc[:,4]** is the parameter specifying which column (or series) will be used for stratification of the data. Stratification is important to ensure that class proportions are maintained in the training and testing sets. Here, *dtiris.iloc[:,4]* selects the fifth column (index 4) in dtiris, which contains the labels. This means stratification will be done based on the labels, ensuring that proportions of each class are maintained in the training and testing sets.

### 02. Example of Unbalanced Stratification:

In [49]:
#Analyzing and counting how many records exist in the column from which we will generate the stratified sample
dtinfert['education'].value_counts()

education
6-11yrs    120
12+ yrs    116
0-5yrs      12
Name: count, dtype: int64

In [64]:
#We use the train_test_split method to generate our stratified sample
x_infert, _, y_infert, _ = train_test_split(dtinfert.iloc[:,2:9], dtinfert.iloc[:,1], test_size=0.6, stratify=dtinfert.iloc[:,1])
y_infert.value_counts()

education
6-11yrs    48
12+ yrs    46
0-5yrs      5
Name: count, dtype: int64

In [62]:
#Linking and filtering the sample values with the population values
str_sample_unbalanced = dtinfert.loc[y_infert.index]
str_sample_unbalanced

Unnamed: 0.1,Unnamed: 0,education,age,parity,induced,case,spontaneous,stratum,pooled.stratum
223,224,12+ yrs,28,1,0,0,0,59,42
147,148,12+ yrs,26,2,0,0,1,65,49
164,165,12+ yrs,23,1,0,0,1,83,40
76,77,12+ yrs,31,2,0,1,1,77,53
67,68,12+ yrs,28,3,1,1,2,68,58
...,...,...,...,...,...,...,...,...,...
74,75,12+ yrs,26,2,0,1,2,75,49
110,111,6-11yrs,34,3,0,0,0,28,31
0,1,0-5yrs,26,6,1,1,2,1,3
123,124,6-11yrs,39,3,1,0,0,41,33


About the process:
- The return (y_infert.value_counts()) contains only a fraction of the original data, not 60% as expected.
- The reason for this is that the train_test_split function divides the data in such a way that the proportion between the classes is maintained in the training and testing sets. Therefore, even if you specify test_size=0.6, the function will ensure that the distribution of classes is preserved, which may result in a different split than expected in absolute terms.
- You are viewing the class counts in the y_infert set, which corresponds to the target set for training (labels). Therefore, these numbers represent 40% of the original training set dtinfert, while the remaining 60% were allocated to the test set.