
# MSS482 - GRAPHING TECHNOLOGY IN MATHEMATICS AND SCIENCE

**SEMESTER 1 2023/2024**


>R.U.Gobithaasan (2023). School of Mathematical Sciences, Universiti Sains Malaysia.
[Official Website](https://math.usm.my/academic-profile/705-gobithaasan-rudrusamy) 


<p align="center">
     © 2023 R.U. Gobithaasan All Rights Reserved.
</p>

# Analysing more than one variable
- https://www.pythonfordatascience.org/independent-samples-t-test-python/

3.1 Anlyzing dataset with <br>
    a) Continuous features <br>
    b) Categorical features <br>

3.2. t-Test <br>
    a) Independent Samples t-test <br>
    b) Paired Samples t-test <br>


3.3 Correlation






### requirements

> Install the following: `!python -m pip install pandas`
1. pandas
2. researchpy
3. statsmodels
4. matplotlib
5. seaborn

### Dataset: Online Dataset sources

**Online Sources:** 
- Google Dataset Search: https://datasetsearch.research.google.com/ 
- Kaggle: https://www.kaggle.com/datasets 
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php 
- Earth Data: https://www.earthdata.nasa.gov/
- Scikit Dataset: https://scikit-learn.org/stable/datasets.html
- https://github.com/gob1thaasan/Data-sets 


### Tips

In [None]:
# Magic command to display Matplotlib plots inline :https://ipython.readthedocs.io/en/stable/interactive/magics.html
%matplotlib inline
# To ignore warnings, use the following code to make the display more attractive.
# Import seaborn and matplotlib.
import warnings
warnings.filterwarnings("ignore")

# Analyzing Continuous & Categorical Dataset

## Numeric Data
> Numeric data is information that can be expressed as a number: Can be continuous or discrete.

Continuous data is data that can take any value. Height, weight, temperature and length are all examples of continuous data. Some continuous data will change over time; the weight of a baby in its first year or the temperature in a room throughout the day. This data is best shown on a line graph as this type of graph can show how the data changes over a given period of time. Other continuous data, such as the heights of a group of children on one particular day, is often grouped into categories to make it easier to interpret.

Discrete data is information that can only take certain values. These values don’t have to be whole numbers. For example, the number of hospital visit to schedule for the week and the number of students attending a lecture each day. This type of data is often represented using tally charts, bar charts or pie charts.


---
## Categorical Data
- https://www.datacamp.com/tutorial/categorical-data

Data that can be categorized but lacks an inherent hierarchy or order is known as categorical data. In other words, there is no mathematical connection between the categories. A person's gender (male/female), eye color (blue, green, brown, etc.), type of vehicle they drive (sedan, SUV, truck, etc.), or the kind of fruit they consume (apple, banana, orange, etc.) are examples of categorical data.

- In simple terms, categorical data is information that can be put into categories.
- Since the majority of machine learning algorithms are created to operate with numerical data, categorical data is handled differently from numerical data in this field. 
- Before categorical data can be utilized as input to a machine learning model, it must first be transformed into numerical data. 
- This process of converting categorical data into numeric representation is known as encoding.

---

### Univariate: Weight of a group (toy example)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
x = np.array([148, 154, 158, 160, 161, 162, 166, 170, 182, 195, 236])

# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(x, kde=True, color='salmon')
# Adjust layout
plt.tight_layout()
# Show the plot
plt.show()


- Kernel Density Estimation (KDE) is a non-parametric method used for estimating the probability density function (PDF) of a continuous random variable. It's a technique to visualize the distribution of data in a smoothed continuous form.
- In simple terms, KDE provides a smooth curve that approximates the shape of the underlying probability distribution of a dataset. It estimates the density function by placing a kernel (a smooth, symmetric function such as a Gaussian or Epanechnikov kernel) at each data point and summing up these kernels to create a smooth curve that represents the overall distribution of the data.

In [None]:
# Using Seaborn to create the boxplot
sns.boxplot(data=x)  # You can specify your own color palette

# Setting labels and title
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Boxplot Example for toy data')

# Show the plot
plt.show()

#### Multivariate: Fictional blood-pressure data

Multivariate analysis refers to the analysis of datasets involving more than one variable. It aims to understand the relationships between multiple variables simultaneously and uncover patterns, dependencies, and interactions among them.

- There are various techniques and methods for multivariate analysis, each serving different purposes based on the nature of the data and the objective of the analysis. 

- When performing multivariate analysis, it's crucial to understand the assumptions of each technique and interpret the results accordingly. Additionally, visualization techniques like scatter plots, heatmap, and pair plots can aid in understanding relationships between multiple variables.

- The choice of technique depends on the research question, the nature of the data, and the specific objectives of the analysis. Depending on your dataset and research objectives, you can select an appropriate multivariate analysis technique to derive insights and patterns from your data.



### Example: blood pressure of patients before and after treatment (taken from a book Stata 11 manual: https://www.stata-press.com/data/r11/)


In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")
df.info()

In [None]:
df.columns

In [None]:
df.tail()

In [None]:
df.describe()

<div class="alert alert-block alert-warning">
<b>Types of feature?</b>  Classify the features into continuous and categorical dataset
</div>

In [None]:
df['sex'].unique()

In [None]:
age_group = df['agegrp'].unique()
print(age_group)

In [None]:
df['bp_before'].unique()

- Blood pressure is an example of continuous data. Blood pressure can be measured to as many decimals as the measuring instrument allows. For example, although a typical blood pressure cuff does not provide decimal places, a digital blood pressure monitor (often used in hospital settings) may have the capacity to determine the blood pressure to 3 decimal places, and even more powerful blood pressure monitors may be developed that can read a patients blood pressure to 5 decimal places.

<div class="alert alert-block alert-info">
<b>Sampling:</b> Randomly choosing some samples from a population.
</div>

In [None]:
df.sample(10)

### Exploratory Data Analysis: Visualizatiton


In [None]:
# Setting the plot style (e.g., 'ggplot', 'seaborn-dark', 'fivethirtyeight', etc.)
'''
'seaborn' - Seaborn-like style
'ggplot' - Style similar to ggplot in R
'fivethirtyeight' - Style similar to plots on FiveThirtyEight
'classic' - Classic Matplotlib style
'bmh' - Style from the Bayesian Methods for Hackers book
'dark_background' - Dark background style
'tableau-colorblind10' - Tableau colorblind 10 palette
'Solarize_Light2' - Light background with strong colors
'seaborn-dark-palette' - Seaborn dark palette
'seaborn-whitegrid' - Seaborn style with white grid lines
'''
plt.style.use('ggplot')

df_gender = df.groupby('sex')
df_gender.boxplot()
df_gender.head()

<div class="alert alert-block alert-danger">
<b>Exercise:</b> Try plotting boxplot grouped by age group.
</div>

In [None]:
# customizing dataframe for further analysis.
df_male_after = df['bp_after'][df['sex'] == 'Male']
df_female_after = df['bp_after'][df['sex'] == 'Female']

In [None]:
after_treatment_gender_data = {'male': np.array(df_male_after), 
                          'female': np.array(df_female_after)}
df_after_treatment_gender =pd.DataFrame(after_treatment_gender_data)
df_after_treatment_gender.describe()

In [None]:
sns.boxplot(data=df_after_treatment_gender)

plt.title('Boxplot of blood pressure after treatment based on gender')
plt.xlabel('gender')
plt.ylabel('blood pressure')
plt.show()

Seaborn offers a variety of color palettes that you can use for your plots. Here is a list of some of the named palettes available in Seaborn:

- Sequential Color Palettes:
'rocket'
'mako'
'flare'
'crest'
'cividis'
'viridis'
'plasma'
'inferno'
'magma'

- Diverging Color Palettes:
'coolwarm'
'RdBu'
'PuOr'
'BrBG'
'PiYG'
'PRGn'

- Qualitative Color Palettes:
'pastel'
'bright'
'dark'
'deep'
'colorblind'
'Set1', 'Set2', 'Set3'
'tab10', 'tab20', 'tab20b', 'tab20c'

- Other Palettes:
'husl'
'hls'
'twilight'
'twilight_shifted'

In [None]:
# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(df_male_after, kde=True, palette='mako')

In [None]:
# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(df_female_after, kde=True, palette='bright')

In [None]:
# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(df_after_treatment_gender, kde=True, palette='bright')

 <div class="alert alert-block alert-danger">
<b>What type of distribution does this feature possess?</b> Try plotting probability plot.
</div>

A probability plot, also known as a Q-Q (quantile-quantile) plot, is a graphical method used to assess whether a dataset follows a particular theoretical distribution, such as the normal distribution.

The main purpose of a probability plot is to visually compare the quantiles of a dataset against the quantiles of a theoretical distribution. If the data follows the theoretical distribution closely, the points on the plot will fall approximately along a straight line, indicating that the data fits that distribution well.

For instance, when assessing normality using a Q-Q plot:

- If the data points form a straight line, it indicates the data is normally distributed.
- If the points deviate from a straight line, it suggests a departure from normality.

>`probplot` optionally calculates a best-fit line for the data and plots the
results using Matplotlib or a given plot function.

In [None]:

import scipy.stats as stats
import matplotlib.pyplot as plt

# Using a custom color palette
custom_palette = sns.color_palette(['#FF5733', '#33FF57', '#3357FF'])  # List of RGB/hex colors: https://www.color-hex.com/
sns.set_palette(custom_palette)

sampling_male= np.array(df['bp_after'][df['sex'] == 'Male'])

normality_plot, stat = stats.probplot(sampling_male, plot= plt, rvalue= True,)
plt.show()

 <div class="alert alert-block alert-danger">
<b>What type of distribution does female samples possess?</b> 
</div>

 <div class="alert alert-block alert-danger">
<b>What type of distribution does the sampling difference between male and female possess?</b> Plot the probability plot and its distribution.
</div>

In [None]:
import scipy.stats as stats
import matplotlib.pyplot as plt

# Using a custom color palette
custom_palette = sns.color_palette(['#FF5733', '#33FF57', '#3357FF'])  # List of RGB/hex colors: https://www.color-hex.com/
sns.set_palette(custom_palette)

sampling_difference = np.array(df['bp_after'][df['sex'] == 'Male'])- np.array(df['bp_after'][df['sex'] == 'Female'])

normality_plot, stat = stats.probplot(sampling_difference, plot= plt, rvalue= True,)
plt.show()

In [None]:
# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(sampling_difference, kde=True, palette=custom_palette)

> This shows our data is normalized, although, at two ends the data does not exactly fit the red line

Another method to check for the normality is to use the Shapiro-Wilk test.
- The value of this statistic tends to be high (close to 1) for samples drawn from a normal distribution.
- The null hypothesis of the Shapiro-Wilk test is that the data are normally distributed. 
- If the p-value resulting from the test is less than a chosen significance level (commonly 0.05), we reject the null hypothesis, concluding that the data do not follow a normal distribution.
- If the p-value is “small” - that is, if there is a low probability of sampling data from a normally distributed population.

In [None]:
stats.shapiro(sampling_difference)

Both the statistic and p-value are high, 98%  and 71% respectively, which means our residual data is normally distributed.

---
## More on Categorical Dataset
- https://www.datacamp.com/tutorial/categorical-data


 - This classic dataset contains the prices and other attributes of almost 54,000 diamonds. It's a great dataset for beginners learning to work with data analysis and visualization.
 https://www.kaggle.com/datasets/shivam2503/diamonds

 Below is a sample of the dataset and its the features:

- price: price in US dollars (find the range)

- carat: weight of the diamond (find the range)

- cut: quality of the cut (find the categories)

- color: diamond colour, from J (worst) to D (best)

- clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

- Symmetry (find the categories)

- Report (find the categories)

- Polish (find the categories)


In [None]:
# read csv using pandas
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# check the data types
data.info()


<div class="alert alert-block alert-warning">
<b>Types of feature?</b>  Classify the features into continuous and categorical dataset
</div>

- Well, all the columns in this example are categorical except for `Carat Weight` and `Price.` Let’s see if we are right about this by checking the default data types.

In [None]:

# check the head of dataframe
data.head()

`value_counts()` is a function in the pandas library that returns the frequency of each unique value in a categorical data column. This function is useful when you want to get a quick understanding of the distribution of a categorical variable, such as the most common categories and their frequency.

In [None]:
# read csv using pandas
import pandas as pd
data = pd.read_csv('https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/diamond.csv')

# check value counts of Cut column
bar_data = data['Cut'].value_counts()
bar_data.head()

In [None]:

cut_counts = data['Cut'].value_counts()
df_type_cut = pd.DataFrame({'cut type': list(cut_counts.index), 'count': list(cut_counts.values)})
df_type_cut.plot.bar(x='cut type', y='count')


`groupby()` is a function in Pandas that allows you to group data by one or more columns and apply aggregate functions such as sum, mean, and count. This function is useful when you want to perform more complex analysis on categorical data, such as computing the average of a numeric variable for each category. Let’s see an example:

In [None]:
# applying groupby() function to
d_color = data.groupby('Color')
# Let's print the first entries in all the groups formed.

d_color.first()

<div class="alert alert-block alert-warning">
<b>Find the details of a particular group:</b>`get_group()`
</div>

In [None]:
d_color.get_group('D')

<div class="alert alert-block alert-warning">
<b>Let's convert a categorical feature into numerical representation</b>: We can then carry out more computation & calculation for modelling and machine learning tasks.
</div>

In [None]:
data['Cut']

One- hot encoding is a process of representing categorical data as a set of binary values, where each category is mapped to a unique binary value. 
- In this representation, only one bit is set to 1, and the rest are set to 0, hence the name "one hot." 
- This is commonly used in machine learning to convert categorical data into a format that algorithms can process.
- The `pd.get_dummies() function in pandas performs one-hot encoding by converting the categorical variable ('Cut') into multiple binary columns representing each category. The new columns have binary values (0 or 1) indicating the presence of each category in the original data.
- For example, we can perform one-hot encoding on categorical features using SKLearn, then train a basic machine learning model (Logistic Regression) using the encoded features.

In [None]:
# apply get_dummies function
df_encoded = pd.get_dummies(data["Cut"])
df_encoded .tail()

In [None]:
# Converting one-hot encoded DataFrame to 0-1 array
array_representation = df_encoded.values
print(array_representation)

We will learn more on Analysis of Categorical Data in the last week of this course
- https://ethanweed.github.io/pythonbook/05.01-chisquare.html 

---

<div class="alert alert-block alert-warning">
<b>Carry out similar tasks for iris dataset. Continue the code below.</b>
</div>

- Boxplot
- Scatterplot
- Distribution

In [None]:
import pandas as pd
#df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")
df = pd.read_csv("https://raw.githubusercontent.com/Opensourcefordatascience/Data-sets/master/Iris_Data.csv")
df.info()

In [None]:
df.head()

---

<div class="alert alert-block alert-danger">
<b>The rest of sections  will be updated by 29/12/2023...</b>
</div>

# t-Tests
- https://www.pythonfordatascience.org/parametric-assumptions-python/

#### ASSUMPTION CHECK
The assumptions in this section need to be met in order for the test results to be considered valid. A more in-depth look at parametric assumptions is provided here, which includes some potential remedies.

> THE TWO SAMPLES ARE INDEPENDENT:
This assumption is tested when the study is designed. What this means is that no individual has data in group A and B; mutually exclusive.

> POPULATION DISTRIBUTIONS ARE NORMAL:
One of the assumptions is that the sampling distribution is normally distributed. This test of normality applies to the difference in values between the groups. 

        - We can use probability plot available in Scipy.stat
        - We can also use Shapiro-Wilk test. This can be completed using the shapiro() method from Scipy.stats.



In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
x = np.array([148, 154, 158, 160, 161, 162, 166, 170, 182, 195, 236])

# Example of using a Seaborn default color palette; {deep, muted, bright, pastel, dark, colorblind}
sns.set_palette('deep')
# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(x, kde=True)

In [None]:
normality_plot, stat = stats.probplot(x, plot= plt, rvalue= True)

In [None]:
from scipy import stats
stats.shapiro(x)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
sns.set_palette('bright')
# Generate some random data
data = np.random.normal(size=1000)
# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(data, kde=True)
plt.show()

In [None]:
normality_plot, stat = stats.probplot(data, plot= plt, rvalue= True)

In [None]:
from scipy import stats
stats.shapiro(data)

In [None]:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np

# Generate some random data
data = np.random.exponential(3.45, 10000)

# Plotting distribution plot (histogram + kernel density estimation) on the second subplot
sns.histplot(data, kde=True)
plt.show()

In [None]:
normality_plot, stat = stats.probplot(data, plot= plt, rvalue= True)

In [None]:
from scipy import stats
stats.shapiro(data)

---

### Independent Samples t-test



This method conducts the independent sample t-test and returns only the t test statistic and it's associated p-value. For more information about this method, please refer to the official [documentation page](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html).


- Calculate the T-test for the means of two independent samples of scores.
- This is a test for the null hypothesis that 2 independent samples have identical average (expected) values. 
- This test assumes that the populations have identical variances by default.

        - H0: populations have identical variances by default.
        - H1: populations have different variances.

In [None]:
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/researchpy/Data-sets/master/blood_pressure.csv")


import scipy.stats as stats
stats.ttest_ind(df['bp_after'][df['sex'] == 'Male'],
                df['bp_after'][df['sex'] == 'Female'])

>Interpretation:
1. p= 0.001 (which is <0.05), so we can reject the null hypothesis (H0) and accept the alternative hypothesis(H1).
2. The average blood pressure after the treatment for males, mean= 155.2, was statistically signigicantly higher than females, mean= 147.2 (144.2, 150.2); t(118)= 3.3480, p= 0.001.

There is a statistically significant difference in the average post blood pressure between males and females, t= 3.3480, p= 0.001.



---

###  Paired Samples t-test <br>

# Correlation

Correlation is a statistical measure that describes the strength and direction of a relationship between two numerical variables. It helps in understanding how changes in one variable are associated with changes in another variable.

- Types of Correlation:

1. Pearson Correlation Coefficient (Pearson's r):
    - Measures linear correlation between two continuous variables. Ranges from -1 to +1.
    - +1 indicates a perfect positive linear relationship.
    - -1 indicates a perfect negative linear relationship.
    - 0 indicates no linear relationship.
    - Assumes a linear relationship and normality of data.
2. Spearman's Rank Correlation (Spearman's rho):
    - Measures monotonic relationship between two variables.
    - Based on the ranks of the data (ordinal relationship).
    - Also ranges from -1 to +1.
- Robust to outliers and non-linear relationships.

3. Kendall's Tau:
    - Measures ordinal association between two variables.
S   - imilar to Spearman's correlation but focuses on concordant and discordant pairs of ranks.


- Visualizing correlations can provide a clear understanding of the relationships between variables in a dataset. Heatmaps are commonly used to visualize correlation matrices, especially when dealing with multiple variables.



### Toy example

In [None]:
import pandas as pd

# Sample dataset with numerical columns 'X' and 'Y'
data = {'X': [1, 2, 3, 4, 5], 'Y': [2, 4, 6, 8, 10]}
df = pd.DataFrame(data)

# Calculate Pearson correlation coefficient
pearson_corr = df['X'].corr(df['Y'])  # Pandas corr() function
print(f"Pearson's correlation coefficient: {pearson_corr:.2f}")


This code snippet demonstrates using Pandas' corr() function to calculate the Pearson correlation coefficient between two columns ('X' and 'Y') in a DataFrame.

For Spearman and Kendall correlations, you can use `df.corr(method='spearman')` or `df.corr(method='kendall')` respectively, specifying the method parameter in the corr() function.

Understanding correlations is essential for identifying relationships between variables in your data, helping in feature selection, and guiding further analysis or modeling decisions.

<div class="alert alert-block alert-warning">
<b>Let's visualize the correlation between features:</b> Use heatmap!
</div>

>In Pandas, the` corr()` method calculates the correlation matrix by default using the Pearson correlation coefficient. 

> the correlation matrix and the subsequent heatmap visualization consider only numeric variables, which can be used to explore relationships among continuous measurements in the Iris dataset or any other dataset with mixed data types. Adjust this code as needed for your specific dataset and analysis.

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

#Loading the Iris dataset from Seaborn
iris = sns.load_dataset('iris')

# Selecting only numeric columns for correlation calculation
numeric_columns = iris.select_dtypes(include='number')

# Calculating the correlation matrix
corr_matrix = numeric_columns.corr()
print(corr_matrix)

# Plotting a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f', annot_kws={"size": 10})
plt.title('Correlation Heatmap of Iris Dataset ')
plt.show()