# Data Analysis and Process

## Student t-test

Fundamental statistic application: How do you know your assumption is valid? Or how confident you are to your assumption? 

In [None]:
from IPython.display import Image, IFrame

Image(url= "https://www.scribbr.com/wp-content/uploads/2022/06/t-table-interpretation-l.webp")

Example 1:

https://www.statology.org/when-to-reject-null-hypothesis/

A one sample t-test is used to test whether or not the mean of a population is equal to some value.

For example, suppose we want to know whether or not the mean weight of a certain species of turtle is equal to 310 pounds.

We go out and collect a simple random sample of 40 turtles with the following information:

Sample size n = 40

Sample mean weight x = 300

Sample standard deviation s = 18.5

We can use the following steps to perform a one sample t-test:

Question:

H0: μ = 310 (population mean is equal to 310 pounds)

HA: μ ≠ 310 (population mean is not equal to 310 pounds)

<br>

單樣本 t 檢定用於測試一個母體的平均值是否等於某個特定值。

例如，假設我們想知道某種海龜的平均體重是否等於 310 磅。

我們隨機抽取了 40 隻海龜的簡單隨機樣本，並獲得以下信息：

樣本數量 n = 40

樣本平均體重 x = 300

樣本標準差 s = 18.5

我們可以使用以下步驟來執行單樣本 t 檢定：

假設：

H0: μ = 310 (母體平均值等於 310 磅)

HA: μ ≠ 310 (母體平均值不等於 310 磅)

### 1. Compute the t-value

$$
\text{Distribution} \sim N\left(\mu, \frac{s}{\sqrt{n}}\right)
$$

$$t = \frac{x-{\mu}}{s/\sqrt{n}}$$

In [None]:
import numpy as np

x = 300
mu = 310
s = 18.5
n = 40

t = (x - mu)/(s/np.sqrt(n))

table: https://www.medcalc.org/manual/t-distribution-table.php

In [None]:
degree_of_freedom = n - 1

Say... We want a 95% confidence interval

$${z_{0.025, 39}} = 2.023$$

Because $|t| > {z_{0.025, 39}}$ , we reject the hypothesis H0

On the other hand, if $|t| <{z_{0.025, 39}}$, we said that we do not reject the null hypothesis.

### Confidence Interval

Select, say, two-sides 95% confidence interval

$$\text{lower limit} = x - {z_{0.025, 39}} * \frac{s}{\sqrt{n}} $$

$$\text{upper limit} = x + {z_{0.025, 39}} * \frac{s}{\sqrt{n}} $$

https://nulib.github.io/moderndive_book/ismaykimkuyper_files/figure-html/N-CIs-1.png

In [None]:
s_n = s/np.sqrt(n)

In [None]:
lower_limit = 300 - 2.2023 * s_n

upper_limit = 300 + 2.2023 * s_n

In [None]:
s_n

### If we reduce the the number of samples, the uncertainty increases

In [None]:
x = 300
mu = 310
s = 18.5
n = 10

In [None]:
t = (x - mu)/(s/np.sqrt(n))

s_n = s/np.sqrt(n)

In [None]:
z = 2.262

s_n = 5.85

upper_limit = 300 + z * s_n
lower_limit = 300 - z * s_n

- Looking up the z-value in a table can be time-consuming and we might not always find the exact value we need. Can we automate this process with a program?

### scipy.stats

In [None]:
IFrame(src="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html", height=500, width=800)

$$
q = (1 - \text{confidence interval})/2 \quad \text{or} \quad q = 1 - \frac{1 - \text{confidence}}{2} = \frac{1 + \text{confidence}}{2}
$$

In [None]:
from scipy import stats

stats.t.ppf(0.025, 39)

In [None]:
def null_hypothesis_fn(mean_sample, mean_hypothesis, std, sample_size, confidence):

    """
    Performs a one-sample t-test to determine whether to reject the null hypothesis.

    Parameters:
    mean_sample (float): The mean of the sample data.
    mean_hypothesis (float): The hypothesized population mean.
    std (float): The standard deviation of the sample data.
    sample_size (int): The number of samples.
    confidence (float): The confidence level for the test (e.g., 0.95 for 95% confidence).

    Returns:
    tuple: A tuple containing:
        - reject (bool): Whether the null hypothesis is rejected.
        - lower_limit (float): The lower limit of the confidence interval.
        - upper_limit (float): The upper limit of the confidence interval.
        - t (float): The t-statistic calculated for the test.
        - z_value (float): The critical t-value for the given confidence level and degrees of freedom.
    """

    # Calculate the standard error of the mean
    s_n = std/np.sqrt(sample_size)

    # Calculate the t-statistic
    t = (mean_sample - mean_hypothesis)/(s_n)

    # Calculate the critical t-value for the given confidence level
    q = (1+confidence)/2
    df = sample_size - 1
    z_value = stats.t.ppf(q, df)

    # Determine whether to reject the null hypothesis
    if abs(t) > z_value:
        reject = True
    else:
        reject = False

    # Calculate the confidence interval
    upper_limit = mean_hypothesis + z_value * s_n
    lower_limit = mean_hypothesis - z_value * s_n

    return reject, lower_limit, upper_limit, t, z_value
    

In [None]:
null_hypothesis_fn(mean_sample=300, mean_hypothesis=310, std=18.5, sample_size=40, confidence=0.95)

- How to calculate statistics such as mean and standard deviation, and determine the sample size when provided with a file?

### Pandas

In [None]:
import pandas as pd

temperature_df = pd.read_excel("Data.xlsx", sheet_name='Temperature', index_col=0)

In [None]:
temperature_df

#### mean

- After executing this line, you'll obtain a Series where each element corresponds to the mean temperature of a city across all months.

In [None]:
temperature_df.mean(axis=1)

- After executing this line, you'll obtain a Series where each element corresponds to the mean temperature of a specific month averaged across all cities.

In [None]:
temperature_df.mean(axis=0)

#### sort

In [None]:
temperature_df.mean(axis=1).sort_values(ascending=False).index[0]

In [None]:
temperature_df.mean(axis=1).sort_values(ascending=True).index[0]

#### std

In [None]:
temperature_df.std(axis=1)

#### count

In [None]:
num_rows = temperature_df.shape[0]
num_cols = temperature_df.shape[1]

### Requests

- URL Assignment: The first line of code assigns a web address (URL) to a variable named url. This URL points to a text file hosted on the internet.

- Request to Retrieve Data: The second line uses Python's requests library to make a request to the URL specified by url. This request retrieves the contents of the text file located at that URL (index.txt). The allow_redirects=True parameter allows the request to follow any redirections that the server might instruct.

In [None]:
import requests

import pandas as pd

url="https://online.stat.psu.edu/stat462/sites/onlinecourses.science.psu.edu.stat462/files/data/skincancer/index.txt"
r = requests.get(url, allow_redirects=True)

Saving Data Locally: The with open('index.txt', 'wb') as f: part begins a block of code that opens a file named index.txt in write mode ('wb' means write binary). Inside this block, the retrieved content (r.content) from the web request is written into this local file (index.txt). This effectively saves a copy of the text file from the internet onto your computer.

In [None]:
with open('index.txt', 'wb') as f:
    f.write(r.content)

c = pd.read_csv('index.txt', delim_whitespace=True)

Loading Data into Pandas DataFrame: Finally, the last line of code uses the pandas library (pd) to read the contents of the index.txt file into a DataFrame (c). - 

- The read_csv function of pandas is used here with the following parameters:
'index.txt': This specifies the filename from which to read the CSV data.
- delim_whitespace=True: This parameter tells pandas to use whitespace (spaces, tabs) as the delimiter between columns in the text file, instead of commas which is the default for CSV files.

### Summary in Simple Terms:

- The code fetches data from a specific web address that hosts a text file (index.txt).
- It then saves a copy of this text file onto your computer.
- Finally, it loads the data from this saved file into a format that allows easy manipulation and analysis, represented as a table-like structure called a DataFrame.