# Data Analysis and Process

## Student t-test

Fundamental statistic application: How do you know your assumption is valid? Or how confident you are to your assumption? 

In [None]:
from IPython.display import Image, IFrame

Image(url= "https://www.scribbr.com/wp-content/uploads/2022/06/t-table-interpretation-l.webp")

Example 1:

https://www.statology.org/when-to-reject-null-hypothesis/

A one sample t-test is used to test whether or not the mean of a population is equal to some value.

For example, suppose we want to know whether or not the mean weight of a certain species of turtle is equal to 310 pounds.

We go out and collect a simple random sample of 40 turtles with the following information:

Sample size n = 40

Sample mean weight x = 300

Sample standard deviation s = 18.5

We can use the following steps to perform a one sample t-test:

Question:

H0: μ = 310 (population mean is equal to 310 pounds)

HA: μ ≠ 310 (population mean is not equal to 310 pounds)

<br>

單樣本 t 檢定用於測試一個母體的平均值是否等於某個特定值。

例如，假設我們想知道某種海龜的平均體重是否等於 310 磅。

我們隨機抽取了 40 隻海龜的簡單隨機樣本，並獲得以下信息：

樣本數量 n = 40

樣本平均體重 x = 300

樣本標準差 s = 18.5

我們可以使用以下步驟來執行單樣本 t 檢定：

假設：

H0: μ = 310 (母體平均值等於 310 磅)

HA: μ ≠ 310 (母體平均值不等於 310 磅)

### 1. Compute the t-value

$$
\text{Distribution} \sim N\left(\mu, \frac{s}{\sqrt{n}}\right)
$$

In [None]:
import numpy as np

x = 300
mu = 310
s = 18.5
n = 40

z_score = (x - mu)/(s/np.sqrt(n))

table: https://www.medcalc.org/manual/t-distribution-table.php

In [None]:
degree_of_freedom = n - 1

Say... We want a 95% confidence interval

$${t_{0.025, 39}} = 2.023$$

Because $|z_score| > {t_{0.025, 39}}$ , we reject the hypothesis H0

On the other hand, if $|z_score| <{t_{0.025, 39}}$, we said that we do not reject the null hypothesis.

### Confidence Interval

Select, say, two-sides 95% confidence interval

$$\text{lower limit} = x - {t_{0.025, 39}} * \frac{s}{\sqrt{n}} $$

$$\text{upper limit} = x + {t_{0.025, 39}} * \frac{s}{\sqrt{n}} $$

https://nulib.github.io/moderndive_book/ismaykimkuyper_files/figure-html/N-CIs-1.png

In [None]:
s_n = s/np.sqrt(n)

In [None]:
lower_limit = 300 - 2.2023 * s_n

upper_limit = 300 + 2.2023 * s_n

In [None]:
s_n

### If we reduce the the number of samples, the uncertainty increases

In [None]:
x = 300
mu = 310
s = 18.5
n = 10

In [None]:
z_score = (x - mu)/(s/np.sqrt(n))

s_n = s/np.sqrt(n)

In [None]:
t = 2.262

s_n = 5.85

upper_limit = 300 + z * s_n
lower_limit = 300 - z * s_n

- Looking up the z-value in a table can be time-consuming and we might not always find the exact value we need. Can we automate this process with a program?

### scipy.stats

In [None]:
IFrame(src="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html", height=500, width=800)

$$
q = (1 - \text{confidence interval})/2 \quad \text{or} \quad q = 1 - \frac{1 - \text{confidence}}{2} = \frac{1 + \text{confidence}}{2}
$$

In [None]:
from scipy import stats

stats.t.ppf(0.025, 24)

In [None]:
def null_hypothesis_fn(mean_sample, mean_hypothesis, std, sample_size, confidence):

    """
    Performs a one-sample t-test to determine whether to reject the null hypothesis.

    Parameters:
    mean_sample (float): The mean of the sample data.
    mean_hypothesis (float): The hypothesized population mean.
    std (float): The standard deviation of the sample data.
    sample_size (int): The number of samples.
    confidence (float): The confidence level for the test (e.g., 0.95 for 95% confidence).

    Returns:
    tuple: A tuple containing:
        - reject (bool): Whether the null hypothesis is rejected.
        - lower_limit (float): The lower limit of the confidence interval.
        - upper_limit (float): The upper limit of the confidence interval.
        - z_score (float): The t-statistic calculated for the test.
        - t (float): The critical t-value for the given confidence level and degrees of freedom.
    """

    # Calculate the standard error of the mean
    s_n = std/np.sqrt(sample_size)

    # Calculate the t-statistic
    z_scopre = (mean_sample - mean_hypothesis)/(s_n)

    # Calculate the critical t-value for the given confidence level
    q = (1+confidence)/2
    df = sample_size - 1
    t = stats.t.ppf(q, df)

    # Determine whether to reject the null hypothesis
    if abs(z_score) > t:
        reject = True
    else:
        reject = False

    # Calculate the confidence interval
    upper_limit = mean_hypothesis + t * s_n
    lower_limit = mean_hypothesis - t * s_n

    return reject, lower_limit, upper_limit, z_score, t
    

In [None]:
null_hypothesis_fn(mean_sample=300, mean_hypothesis=310, std=18.5, sample_size=40, confidence=0.95)

- How to calculate statistics such as mean and standard deviation, and determine the sample size when provided with a file?

### Pandas

In [None]:
import pandas as pd

temperature_df = pd.read_excel("Data.xlsx", sheet_name='Temperature', index_col=0)

In [None]:
temperature_df

#### mean

- After executing this line, you'll obtain a Series where each element corresponds to the mean temperature of a city across all months.

In [None]:
temperature_df.mean(axis=1)

- After executing this line, you'll obtain a Series where each element corresponds to the mean temperature of a specific month averaged across all cities.

In [None]:
temperature_df.mean(axis=0)

#### sort

In [None]:
temperature_df.mean(axis=1).sort_values(ascending=False).index[0]

In [None]:
temperature_df.mean(axis=1).sort_values(ascending=True).index[0]

#### std

In [None]:
temperature_df.std(axis=1)

#### count

In [None]:
num_rows = temperature_df.shape[0]
num_cols = temperature_df.shape[1]

#### filter

In [None]:
temperature_df.filter(like='J')

#### apply

- Apply a function along an axis of the DataFrame.

In [None]:
temperature_df['Jan'].apply(lambda x: x + 10)

In [None]:
def fn(x):
    
    return x + 10

In [None]:
temperature_df['Jan'].apply(lambda x: fn(x))

In [None]:
def fn_2(x):
    
    return x + 10, x + 20

In [None]:
temperature_df[['Jan']].apply(lambda x: fn_2(x[0]), axis=1, result_type='expand')

#### Merge

- The merge function in Pandas is similar to SQL joins. It combines two DataFrames based on one or more keys.
- The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

In [None]:
# Create two DataFrames
df1 = pd.DataFrame({
    'employee_id': [1, 2, 3, 4],
    'name': ['John', 'Anna', 'Peter', 'Linda']
})

df2 = pd.DataFrame({
    'employee_id': [3, 4, 5, 6],
    'department': ['HR', 'Finance', 'IT', 'Operations']
})

# Merge the DataFrames on 'employee_id'
merged_df = pd.merge(df1, df2, on='employee_id', how='inner')
print(merged_df)

#### Concatenate

- The concatenate function in Pandas is used to append DataFrames along a particular axis (either rows or columns).

In [None]:
# Create two DataFrames
df3 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2', 'A3'],
    'B': ['B0', 'B1', 'B2', 'B3']
})

df4 = pd.DataFrame({
    'A': ['A4', 'A5', 'A6', 'A7'],
    'B': ['B4', 'B5', 'B6', 'B7']
})

# Concatenate DataFrames along rows
concat_df = pd.concat([df3, df4], axis=0)
print(concat_df)

#### Join

- The join function in Pandas is used to combine two DataFrames on the index or on a key column.

In [None]:
# Create two DataFrames
df5 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
}, index=['K0', 'K1', 'K2'])

df6 = pd.DataFrame({
    'C': ['C0', 'C1', 'C2'],
    'D': ['D0', 'D1', 'D2']
}, index=['K0', 'K2', 'K3'])

# Join the DataFrames
joined_df = df5.join(df6, how='inner')
print(joined_df)

#### difference between `Merge` & `Join`

- merge
    - Purpose: The merge method is designed to combine DataFrames based on one or more keys (columns) that can be specified.
    - Usage: merge is very flexible and allows for different types of joins: inner, outer, left, and right joins.
    - Syntax: pd.merge(left, right, on='key', how='inner')
    - Default Join: The default join type is an inner join.
    - Column-based: Primarily used to join on columns.
    
- join
    - Purpose: The join method is mainly used to combine DataFrames based on their indices.
    - Usage: join is convenient when you want to combine DataFrames on their index or a key column.
    - Syntax: df1.join(df2, how='left')
    - Default Join: The default join type is a left join.
    - Index-based: Primarily used to join on indices but can also join on a key column.

#### pivot table

In [None]:
data = {
    'Date': ['2024-01-01', '2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03', '2024-01-03'],
    'City': ['New York', 'Los Angeles', 'New York', 'Los Angeles', 'New York', 'Los Angeles'],
    'Sales': [250, 200, 300, 220, 310, 210],
    'Expenses': [150, 180, 190, 210, 160, 200]
}

df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])

print(df)

Let's create a pivot table to summarize the total sales and expenses by city and date.

In [None]:
pivot_table = pd.pivot_table(df, 
                             values=['Sales', 'Expenses'], 
                             index=['Date'], 
                             columns=['City'], 
                             aggfunc='sum')

print(pivot_table)