# PlantGrowth R dataset

## Index
1) Introduction  
   1.1 Project context  
   1.2 Analysis objectives  
   1.3 Description of the PlantGrowth dataset  
2) Libraries used in the analysis of the project 
3) What is the t statistic and ANOVA and what is it used for?  
   3.1 t statistic  
   3.2 ANOVA  
   3.3 Similarities and differences  
3) Dataset Loading and Exploration  
4) Data analysis  
5) Data visualization
6) Conclusions  
7) References

## 1. Introduction

### 1.1 Project context

This fictional database about plants is a pre-loaded example dataset in R and is part of a collection of 2337 datasets which were originally distributed alongside the statistical software environment R and some of its add-on packages.  
The goal is to make these data more broadly accessible for teaching and statistical software development.  

### 1.2 Analysis objectives  

The dataset will be used to perform statistical analyzes to determine if the treatments have a significant effect on plant growth compared to a control group.  
The steps to be carried out are the following:  
1. Download the database and upload it to the IDE  
2. Describe the data set  
3. Describe what a t-test is, how it works, and what the assumptions are  
4. Perform a t-test to determine whether there is a significant difference between the two treatment groups trt1 and trt2.  
5. Perform ANOVA to determine whether there is a significant difference between the three treatment groups ctrl, trt1, and trt2.  
6. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.  

### 1.3 Description of the PlantGrowth dataset  

This fictitious database is part of the base R package (datasets) and contains the results of an experiment that measures the weight of plants after applying different treatments.  


A brief description of this database can be found on the official website [1], summarized in the following table:  

| Package  | Item        | Title                                     | CSV  | Doc  | Rows | Cols | n_binary | n_character | n_factor | n_logical | n_numeric |
|----------|-------------|-------------------------------------------|------|------|------|------|----------|-------------|----------|-----------|-----------|
| datasets | PlantGrowth | Results from an Experiment on Plant Growth | CSV  | Doc  | 30   | 2    | 0        | 0           | 1        | 0         | 1         |


The database is really small(referring to the amount of information it contains), and contains the following data:  

- Number of observations (rows):  30 is the sample size  
- Number of variables: 2 
    weight: A numerical variable that measures the weight of plants (in arbitrary units) after a period of growth.  
    group: A categorical variable that indicates the treatment group to which each plant belongs. It has three levels:  

            ctrl: Control group, without treatment applied.    
            trt1: First treatment group.  
            trt2: Second treatment group.  

## 2. Libraries used in the analysis of the project  




To carry out this project, the following python libraries are useful:

**Pandas**: It is one of the main libraries for the manipulation and analysis of existing data. It is used to load, process and clean the data set.  

**NumPy**: Used for numerical calculations and matrix operations, commonly used with Pandas.

**Matplotlib**: Library used for data visualization.  

**SciPy**: Contains statistical tools that allow you to perform tests such as ANOVA to analyze the differences between groups.


## 3. What is the t statistic and ANOVA and what is it used for? 

### 3.1 t statistic

**The t-statistic**, which is a value that serves as a test to know if the difference between the response of two groups is statistically significant or not.  

This statistic is based on the student t distribution, which shares similarities with the normal distribution, since both are continuous, have a bell shape or their standardized mean is equal to 0, however the tails or edges of the t distribution are more coarse, due to the additional uncertainty generated by the lack of knowledge of the sample variance.  
Having already used the normal distribution, the t distribution is used for this assumption since the sample is much smaller and the population variance is unknown.  

### 3.2 ANOVA  

**ANOVA (Analysis of Variance)** is a statistical technique that is used to compare the means of three or more groups and determine if there are significant differences between them. Evaluate variability within and between groups to verify whether observed differences are due to specific factors or chance.  
It is based on the relationship between the explained variance (between groups) and the unexplained variance (within groups), expressed by a statistic called 𝐹.  
This statistic is commonly used in experiments and studies with multiple categories.

###  3.3 Similarities and differences 
Both the t statistic and ANOVA (the analysis of variance) are closely related because they both evaluate differences between group means, but they differ in how they are applied and in the scenarios for which they are designed.    


The usefulness and difference between both statistics is shown in the following table:

| **Aspect**         | **t-test**                                      | **ANOVA**                                     |
|---------------------|------------------------------------------------|-----------------------------------------------|
| **Number of groups** | Compares the means of **two groups**           | Compares the means of **three or more groups** |
| **Null hypothesis** | The means of the two groups are equal          | All group means are equal                     |
| **Statistic**       | Computes the **t** value                       | Computes the **F** value                      |
| **Purpose**         | Determine if there is a significant difference between two specific groups | Determine if at least one group is different from the others |


## 3 Dataset Loading and Exploration  

To load the data from the database, first of all i have to use the functionalities of the Pandas library, so first of all I have to import all the necessary libraries for both the data analysis and the visualization part.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import scipy.stats as stats  

To load the database, I use the following code that uses Pandas:

In [2]:
# Using pandas to create a  dataframe and open the database
data_path = "project-db/PlantGrowth.csv"
plant_growth_data = pd.read_csv(data_path)

# Check the first rows to see how the database looks like
print(plant_growth_data.head())

   rownames  weight group
0         1    4.17  ctrl
1         2    5.58  ctrl
2         3    5.18  ctrl
3         4    6.11  ctrl
4         5    4.50  ctrl


Although previously in section 1.3 I have made a first introduction and description of the variables, using the following code you can also obtain similar information:

In [3]:
# Dataset size
print(f"Rows and columns: {plant_growth_data.shape}")

Rows and columns: (30, 3)


***
## References  
 
https://vincentarelbundock.github.io/Rdatasets/articles/data.html[1]
https://vincentarelbundock.github.io/Rdatasets/doc/datasets/PlantGrowth.html  
https://github.com/vincentarelbundock/Rdatasets/tree/master  
https://www.investopedia.com/terms/t/t-test.asp  