# Applied Statistics Winter 2024 Project

**by Nur Bujang**

project.ipynb
***

# Title: Analysis of PlantGrowth Data Set

# Project Description

Complete the project in a single notebook called `project.ipynb` in your repository.
The same style should be used as detailed above: explanations in MarkDown and code comments, clean code, and regular commits.
Use plots as appropriate.

In this project, you will analyze the [PlantGrowth R dataset](https://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv).
You will find [a short description](https://vincentarelbundock.github.io/Rdatasets/doc/datasets/PlantGrowth.html) of it on [Vicent Arel-Bundock's Rdatasets page](https://vincentarelbundock.github.io/Rdatasets/).
The dataset contains two main variables, a treatment group and the weight of plants within those groups.

1. Download and save the dataset to your repository.

2. Describe the data set in your notebook.

3. Describe what a t-test is, how it works, and what the assumptions are.

4. Perform a t-test to determine whether there is a significant difference between the two treatment groups trt1 and trt2.

5. Perform ANOVA to determine whether there is a significant difference between the three treatment groups ctrl, trt1, and trt2.

6. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.


## Abstract



## 1.0 Plan

1. Download and save dataset
    - import pandas as pd , import numpy as np
    - url, df=pd.read_csv, df.to_csv
    - df.head, df.tail

2. Describe the data set
    - df.info, df.describe

3. Describe t-test

4. t-test for trt1 and trt2
    - Set null and alternative hypothesis
    - import numpy as np, from scipy import stats
    - alpha
    - ttest_ind

5. ANOVA for crtl, trt1, trt2
    - Set null and alternative hypothesis
    - alpha, crtl, trt1, trt2
    - fstat,fpval = stats.f_oneway(crtl, trt1, trt2)

6. Why, for > 2 groups, ANOVA is more appropriate?

## 2.0 Methods and Implementation

1. Download and save dataset

The PlantGrowth.csv dataset <a href="https://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv">(Arel-Bundock, n.d.)</a> was retrieved from <a href="https://vincentarelbundock.github.io/Rdatasets/doc/datasets/PlantGrowth.html">(Arel-Bundock, n.d.)</a>.

to read csv <a href="https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html">Pandas (n.d.)</a>

df.to_csv <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html">Pandas (n.d.)</a>


 <a href="https://github.com/ianmcloughlin/2425_applied_statistics">McLoughlin (2024)</a>.

In [5]:
import pandas as pd
import numpy as np

url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/PlantGrowth.csv"
df=pd.read_csv(url)
df.to_csv("PlantGrowth.csv", index=False)
df.head()


Unnamed: 0,rownames,weight,group
0,1,4.17,ctrl
1,2,5.58,ctrl
2,3,5.18,ctrl
3,4,6.11,ctrl
4,5,4.5,ctrl


In [6]:
df.tail()

Unnamed: 0,rownames,weight,group
25,26,5.29,trt2
26,27,4.92,trt2
27,28,6.15,trt2
28,29,5.8,trt2
29,30,5.26,trt2


2. Describe the data set

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   rownames  30 non-null     int64  
 1   weight    30 non-null     float64
 2   group     30 non-null     object 
dtypes: float64(1), int64(1), object(1)
memory usage: 852.0+ bytes


This dataset compares plant dried weight from a plant growth experiment subjected under control, treatment1 and treatment2 conditions. It contains 30 instances on three columns, which are rownames, weight and group. 

Column rownames is a 64-bit integer type of whole numbers (without decimal point) and with no missing values. 

Column weight is a 64-bit floating-point numbers data type, which contains decimal points. Even if the column contains both integers and floats, the column will assign it as floats to retain the decimal values <a href="https://datacarpentry.org/python-ecology-lesson/04-data-types-and-format.html">(Gosset and Wright, 2017)</a>. Weight is a continuous numerical data, which is a type of quantitative data and can contain any number of measurements between two points <a href="https://www.g2.com/articles/discrete-vs-continuous-data#what-is-continuous-data">(Zangre, 2024)</a>.
This column has no missing values. Pandas defaults to float if there are missing values in case they have decimals <a href="https://datacarpentry.org/python-ecology-lesson/04-data-types-and-format.html">(Gosset and Wright, 2017)</a>. In this dataset, weight contains dried weight of plants in grams.

Column group is of type object, without missing values. The object data type are often used for text or nominal categorical data type, but can represent any data type, including strings, lists, integers or custom objects <a href="https://numpy.org/devdocs/reference/arrays.dtypes.html">(NumPy Developers, n.d.)</a>. Nominal categorical data is a type of qualitative categorical data that represents categories without specific ranking or order among them <a href="https://www.statisticssolutions.com/levels-of-measurement/">(Statistics Solutions, 2017)</a>. In this dataset, group contains labels that represent three different conditions, which are control, treatment1 and treatment2.

In [7]:
df.describe()

Unnamed: 0,rownames,weight
count,30.0,30.0
mean,15.5,5.073
std,8.803408,0.701192
min,1.0,3.59
25%,8.25,4.55
50%,15.5,5.155
75%,22.75,5.53
max,30.0,6.31


3. Describe what a t-test is, how it works, and what the assumptions are



4. Perform a t-test to determine whether there is a significant difference between the two treatment groups `trt1` and `trt2`.

Two-sample t-test
$$
t = \frac{\bar{X}_1 - \bar{X}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}
$$

where:
- $ \bar{X}_1 $ and $ \bar{X}_2 $ are the sample means,
- $ s_1 $ and $ s_2 $ are the sample standard deviations,
- $ n_1 $ and $ n_2 $ are the sample sizes.



5. Perform ANOVA to determine whether there is a significant difference between the three treatment groups ctrl, trt1, and trt2.

https://www.questionpro.com/blog/anova-testing/#Types_of_ANOVA_testing

One-way ANOVA was performed on the three conditions according to <a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html">The SciPy community (2014)</a> by comparing the variance within each condition to the variance between conditions. 

6. Explain why it is more appropriate to apply ANOVA rather than several t-tests when analyzing more than two groups.

## 3.0 Conclusion

## 4.0 References



***

## End of project.ipynb