<a href="https://colab.research.google.com/github/luuloi/GWAS_Introduction_2023/blob/main/03_Statistics_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Understanding

Given this data related to student performance with:
* StudentID: An identifier for each student.
* Gender: Categorical variable - Male or Female.
* Math_Score: A score for Math out of 100.
* Reading_Score: A score for Reading out of 100.
* Lunch_Type: Type of lunch they usually have - Standard or Free/Red

In [1]:
# Setting a seed for reproducibility
set.seed(123)

# Generating a sample dataframe
sample_df <- data.frame(
  StudentID = 1:100,
  Gender = sample(c("Male", "Female"), 100, replace = TRUE),
  Math_Score = round(runif(100, 50, 100)),  # Random scores between 50 and 100
  Reading_Score = round(runif(100, 50, 100)),  # Random scores between 50 and 100
  Lunch_Type = sample(c("Standard", "Free/Reduced"), 100, replace = TRUE)
)

Unnamed: 0_level_0,StudentID,Gender,Math_Score,Reading_Score,Lunch_Type
Unnamed: 0_level_1,<int>,<chr>,<dbl>,<dbl>,<chr>
1,1,Male,80,62,Free/Reduced
2,2,Male,67,98,Free/Reduced
3,3,Male,74,80,Standard
4,4,Female,98,76,Free/Reduced
5,5,Male,74,70,Standard
6,6,Female,95,94,Standard


1. View the first few rows

In [None]:
head()

2. Summarize the data frame

In [None]:
summary()

3. Measures of Central Tendency
* Mean: Represents the average.
* Median: The middle value when data is arranged in ascending order.
* Mode: Most frequently occurring value (Note: R doesn’t have a built-in mode function, but we can create one).

In [None]:
# Get the mean of Reading_Score following the example of Math_Score
mean_math <- mean(sample_df$Math_Score)
mean_reading <- mean(sample_df$)

# Get the median of Math_Score and Reading_Score
median_math <- median(sample_df$)
median_reading <- median(sample_df$)

# Mode function
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Get the mode of Gender and Lunch_Type
mode_gender <- getmode(sample_df$)
mode_lunch <- getmode(sample_df$)

2. Measures of Dispersion
* Range: Difference between the maximum and minimum values.
* Quartiles: Values that divide your data into quarters.
* Interquartile Range (IQR): Range between the first and third quartile; it describes the middle 50% of values.
* Variance: Measure of the data's spread.
* Standard Deviation: Average distance between each data point and the mean.

In [None]:
# Range of Math_Score and Reading_Score
range_math <- range(sample_df$)
range_reading <- range(sample_df$)

# Quartiles of Math_Score and Reading_Score
quartiles_math <- quantile(sample_df$)
quartiles_reading <- quantile(sample_df$)

# IQR of Math_Score and Reading_Score
iqr_math <- IQR(sample_df$)
iqr_reading <- IQR(sample_df$)

# Variance of Math_Score and Reading_Score
var_math <- var(sample_df$)
var_reading <- var(sample_df$)

# Standard Deviation of Math_Score and Reading_Score
sd_math <- sd(sample_df$)
sd_reading <- sd(sample_df$)

3. Frequency Tables: To describe the frequency of categorical variables

In [4]:
# Frequency tables for Gender and Lunch_Type
gender_freq <- table(sample_df$Gender)
lunch_freq <- table(sample_df$Lunch_Type)


# Visualization

Using the sampled dataset (sample_df), we will:
1. Create a bar plot for the variable Math_Score to understand its distribution.
2. Use ggplot2 to visualize relationships between the categorical variables Gender and Lunch_Type.

In [None]:
# 1. Bar plot for Math_Score
hist(sample_df$, breaks=20, main="Distribution of Math Scores", xlab="Math Score", col="lightgreen", border="black")

In [None]:
# 2. Bar plot using ggplot2 for `Gender` and `Lunch_Type`
library(ggplot2)
ggplot(sample_df, aes(x=, fill=)) +
  geom_bar(position="dodge") +
  theme_minimal() +
  labs(title="Distribution of Gender by Lunch type", y="Count", x="Gender", fill="Lunch Type")

# Statistical Testing

Using the sampled dataset (sample_df), we will:
1. Choose two categorical variables, in this case, Gender and Lunch_Type.
2. Create a contingency table from the chosen variables.
3. Perform both the Chi-square test and Fisher's Exact Test on this table and interpret the results.

In [None]:
# 1. Create a contingency table for `Gender` and `Lunch_Type`
cont_table <- table(sample_df$, sample_df$)

# Display the contingency table
print(cont_table)

In [None]:
# 2. Perform Chi-square test
chi_test_result <- chisq.test()
print(chi_test_result)

In [None]:
# 3. Perform Fisher's Exact Test (especially useful if any cell in the contingency table has value less than 5)
fisher_test_result <- fisher.test()
print(fisher_test_result)

# Advanced Visualization

Given this simulated GWAS data:

In [None]:
set.seed(123)
simulated_gwas <- data.frame(
  SNP = paste0("rs", 1:10000),
  CHR = sample(1:22, 10000, replace = TRUE),
  BP = sample(1:1e8, 10000),
  P = runif(10000, 0, 1)
)

1. Create a Manhattan plot for this simulated data

In [None]:
library(qqman)
manhattan()

2. Highlight SNPs with p-values below 0.05

In [None]:
# Highlight SNPs with p-values below 0.05
significant_snps <- subset(simulated_gwas, < 0.05)

manhattan(simulated_gwas, highlight = significant_snps$)