
# MSS482 - GRAPHING TECHNOLOGY IN MATHEMATICS AND SCIENCE

**SEMESTER 1 2023/2024**


>R.U.Gobithaasan (2023). School of Mathematical Sciences, Universiti Sains Malaysia.
[Official Website](https://math.usm.my/academic-profile/705-gobithaasan-rudrusamy) 


<p align="center">
     Â© 2023 R.U. Gobithaasan All Rights Reserved.
</p>

# Hands-on: Graphing Data with Python

2.1 Dataset <br>
    a) Data types (see powerpoint) <br>
    b) Dataset: sources  <br>
  
2.2. Visualizing Dataset and Investigating its 
Descriptive Statistics <br>
    a) Histogram <br>
    b) Frequency Distribution <br>
    c) boxplot <br>
    d) Stem-and-leaf plot <br>




## Prelimineries:

> Install the following: `!python -m pip install pandas`
1. pandas
2. sklearn
3. numpy
4. matplotlib
5. seaborn
6. plotly


## Dataset: Online Dataset sources

**Online Sources:** 
- Google Dataset Search: https://datasetsearch.research.google.com/ 
- Kaggle: https://www.kaggle.com/datasets 
- UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets.php 
- Earth Data: https://www.earthdata.nasa.gov/
- Scikit Dataset: https://scikit-learn.org/stable/datasets.html 


## Simple Stats of Given Data

### Histograms for various types of distributions
- https://www.geeksforgeeks.org/interpretations-of-histogram/?ref=lbp

In [None]:
# Imports 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Normal histogram plot
data = np.random.normal(10.0, 3, 500)
sns.displot(data, kde= True, bins=10, color='black')

# Left-skewed Histogram
wc_goals =[0]* 19 + [1]*49 + [2]*60 + [3] *47 + [4]*32 + [5]* 18+ [6]*3 + [7]*3 + [8]
sns.displot(wc_goals, bins=8, kde= True, alpha =0.6,color='blue')

# Right-skewed Histogram
wc_goals_conc =[0]* 19 + [-1]*49 + [-2]*60 + [-3] *47 + [-4]*32 + [-5]* 18+ [-6]*3 + [-7]*3 + [-8]
sns.displot(wc_goals_conc, kde = True,bins=8, alpha=0.6, color='red')

# Bi-modal histogram
N=400
mu_1, sigma_1 = 80, 10
mu_2, sigma_2 = 20, 10
# Generate two normal distributions of given mean sd and concatenate
X_1 = np.random.normal(mu_1, sigma_1, N)
X_2 = np.random.normal(mu_2, sigma_2, N)
X = np.concatenate([X_1, X_2])
sns.displot(X,bins=10,kde=True , color='green')

# Uniform histogram (an example of die roll with N=600)
die_roll = [1]*89 + [2]*94 + [3]*110 + [4]*101 + [5]*90 +[6]*116
sns.displot(die_roll, kde=True, bins =6)

# Normal distribution with an outlier
X_1 = np.random.normal(mu_1, sigma_1, N)
X_1 =np.concatenate([X_1, [200]*30])
sns.displot(X_1, kde= True, bins=13)


In [None]:
import matplotlib.pyplot as plt 

data = [16, 25, 47, 47, 56, 23, 45, 19, 55, 55, 55, 44, 27, 90] 

# Plot a histogram
plt.hist(data, bins=10)

# Add x-axis and y-axis titles
plt.xlabel('Data Values')
plt.ylabel('Frequency')

# Display the plot
plt.show()


In [None]:
# you may convert from one type of Python container to another easily and access its built-in functions
import numpy as np 
data_np_array = np.array(data)

In [None]:
# two ways to access built-in functions
print(data_np_array.mean())
print(np.mean(data_np_array))

#### Descriptive Statistics with Numpy
-  https://numpy.org/doc/stable/reference/routines.statistics.html


In [None]:
import numpy as np 
data_np_array = np.array(data)
print(data_np_array)
print('mean:', data_np_array.mean())
print('median:',np.median(data_np_array))
print('min, max:', data_np_array.min(),data_np_array.max())
print('std, var:', data_np_array.std(),data_np_array.var())

### Descriptive Statistics with Scipy

In [None]:
from scipy.stats import mode, describe
print(mode(data_np_array))
describe(data_np_array)

In [None]:
np.percentile(data_np_array, 25)

In [None]:
print('0.0 quantile:', np.quantile(data_np_array,0.0))
print('0.25 quantile:', np.quantile(data_np_array,0.25))
print('0.5 quantile:', np.quantile(data_np_array,0.5))
print('0.75 quantile:', np.quantile(data_np_array,0.75))
print('1.0 quantile:', np.quantile(data_np_array,1.0))

### Descriptive Statistics with Pandas
A [DataFrame](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#) is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. 
>It is similar to a spreadsheet, a SQL table or the data.frame in R.

- Each column in a DataFrame is a **Series**. 
- If you are familiar to Python dictionaries, the selection of a single column is very similar to selection of dictionary values based on the key.
- we use `info` to inspect the DataFrame representation

In [None]:
# you may convert from one type of Python container to another easily and access its built-in functions
import pandas as pd 
data_dataframe = pd.DataFrame(data)
data_dataframe.info()

In [None]:
data_dataframe.head()

In [None]:
data_dataframe.describe()

### Boxplot
 https://www.geeksforgeeks.org/box-plot/

In [None]:
import matplotlib.pyplot as plt 

plt.boxplot(data_np_array,showmeans=True)

# Add labels to the plot
plt.xlabel('Data')
plt.ylabel('Values')
plt.title('Box plot of the data')

plt.show()

In [None]:
sns.set_style("whitegrid") 
sns.boxplot(data_np_array) 

In [None]:
data_dataframe.boxplot()

### Stem-and-leaf plot

In [None]:
sorted_data = np.sort(data_np_array)

In [None]:
sorted_data

In [None]:
# separating the stem parts 
stems = [1, 1, 2, 2, 2, 4, 4, 4, 4, 5, 5, 5, 5, 9] 

plt.ylabel('Data') # for label at y-axis 
plt.xlabel('stems') # for label at x-axis 
plt.xlim(0, 10) # limit of the values at x axis 

plt.stem(stems, sorted_data) # required plot 


In [None]:
#python -m pip install stemgraphic
import stemgraphic

# data and scale 
stemgraphic.stem_graphic(data_np_array, scale = 10) 

### Simulation of Random Data: Normal

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Generate some random data normally distributed
data_normal = np.random.normal(size=1010)
print(data_normal.size)
# Plot a histogram
plt.hist(data_normal, bins=30)

# Add x-axis and y-axis titles
plt.xlabel('Data Values')
plt.ylabel('Frequency')

# Display the plot
plt.show()

In [None]:
import seaborn as sns

# Plot a density plot
sns.kdeplot(data_normal)

# Add labels to the plot
plt.xlabel('Data Values')
plt.ylabel('Density')

plt.show()

In [None]:
sns.catplot(data=data_normal, kind="swarm")

In [None]:
data_normal

In [None]:
from scipy.stats import mode, describe
describe(data_normal)

In [None]:
# Add labels to the plot
plt.boxplot(data_normal,showmeans=True)
plt.xlabel('Data')
plt.ylabel('Values')
plt.title('Box plot of the data')

plt.show()

### Simulation of Random Data: not normal (left skewed)

In [None]:
np.random.exponential?

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Generate some random data: exponential(scale=1.0, size=None)
data_exp = np.random.exponential(2.5, 1010)
print(data_exp.size)
# Plot a histogram

plt.hist(data_exp, bins=30, alpha=0.7, color='green', edgecolor='black')
plt.title('Right-Skewed Distribution')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

In [None]:
#result is a positive skewness. This means that the distribution is skewed to the right
import seaborn as sns

# Plot a density plot
sns.kdeplot(data_exp)


# Add labels to the plot
plt.xlabel('Data Values')
plt.ylabel('Density')

plt.show()

In [None]:
from scipy.stats import mode, describe
describe(data_exp)

- The density curve has a longer tail to the right.
- This is a distribution that is skewed to the right.
- It is also said to be positively skewed since its coefficient of skewness is positive.

In [None]:
# Add labels to the plot
plt.boxplot(data_exp,showmeans=True)
plt.xlabel('Data')
plt.ylabel('Values')
plt.title('Box plot of the data')

plt.show()

### Simulation of Random Data: not normal (left skewed)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generating right-skewed data using the Weibull distribution
data_size = 1010
shape_parameter = 20.0  # Adjust this parameter to control skewness

# Generate data following a Weibull distribution
data_weibull = 5*np.random.weibull(shape_parameter, size=data_size)

# Plotting histogram to visualize the distribution
plt.hist(data_weibull, bins=30, alpha=0.7, color='orange', edgecolor='black')
plt.title('Left-Skewed Weibull Distribution')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()


In [None]:
#result is a positive skewness. This means that the distribution is skewed to the right
import seaborn as sns

# Plot a density plot
sns.kdeplot(data_weibull)


# Add labels to the plot
plt.xlabel('Data Values')
plt.ylabel('Density')

plt.show()

from scipy.stats import mode, describe
describe(data_weibull)

- The density curve has a longer tail to the left.
- This is a distribution that is skewed to the left.
- It is also said to be negatively skewed since its coefficient of skewness is negative.

In [None]:
# Add labels to the plot
plt.boxplot(data_weibull,showmeans=True)
plt.xlabel('Data')
plt.ylabel('Values')
plt.title('Box plot of the data')

plt.show()

In [None]:
type(data_weibull)

### Comparing dataset

In [None]:
import pandas as pd
dataset = {'Normal': data_normal.T,
           'Skew right': data_exp.T,
           'Skew left': data_weibull.T}
df = pd.DataFrame(dataset)

df.head()

In [None]:
df.boxplot()

## Data Acquisition from scikit learn: Iris Dataset

<b>Let's load Iris dataset from SK learn and do the following:</b>

1. What is the type of the data? 
>TIPS: They extend dictionaries by enabling values to be accessed by key, bunch["value_key"], or by an attribute, bunch.value_key.

2. Identify the keys in this dataset.
3. Print its target, target names, desciption and feature names. Explain the code for each types of flower
4. Print iris data and identify the type of iris data.


In [None]:
from sklearn import datasets
iris = datasets.load_iris() #Loading the dataset
type(iris)

In [None]:
iris.DESCR

In [None]:
iris.keys()

In [None]:
iris.target

In [None]:
iris.target_names

In [None]:
iris.feature_names

In [None]:
iris.data

In [None]:
type(iris.data)

<b>Based on loaded Iris dataset, let's do the following:</b>

1. Save the iris data in the form of Pandas Dataframe and name it as iris_data. Also, name the attributes based on the info obtained above. 
2. next add an extra column: the target name as its fifth attribute and print the data matrix
3. identify the type of each attributes
4. Use `head` and `tail` to inspect the first/last 5 rows
5. Compute basic statistics of this data

In [None]:
import pandas as pd
iris_data = pd.DataFrame(iris.data, columns = iris.feature_names)

In [None]:
iris_data['target'] = iris.target

In [None]:
iris_data

In [None]:
iris_data['target'].value_counts()

In [None]:
iris_data.info()

In [None]:
sepal_length= iris_data["sepal length (cm)"]
type(sepal_length)

In [None]:
iris_data.tail(5)

In [None]:
iris_data.describe()

In [None]:
print('sepal_length mean',sepal_length.mean()) # compute mean
# importing matplotlib library
import matplotlib.pyplot as plt
print('sepal_length sd', sepal_length.std()) # std deviation

In [None]:
# importing matplotlib library
import matplotlib.pyplot as plt

# plotting a two attributes:
plt.plot(iris_data["sepal length (cm)"])
plt.plot(iris_data["sepal width (cm)"])
plt.title("Sepal Length  and Sepal Width")
plt.ylabel("Sepal Length / Sepal Width in cm")
plt.xlabel("Instances")


# Function add a legend  
plt.legend(["blue: sepal length", "orange: sepal width"], loc ="lower right")

plt.show()

In [None]:

# plotting a histogram
plt.hist(iris_data["sepal length (cm)"])
plt.show()

> The pandas plot extenstion can be used to make a scatterplot
- Display your plot with plt.show

In [None]:
[iris_data.target]

In [None]:
iris_data.plot(kind="scatter", x="sepal length (cm)", y="sepal width (cm)",color=[iris_data.target],s=30)
plt.show()

In [None]:
iris_data.head()

In [None]:
iris_data.iloc[:, 0:4].boxplot()

### Loanding local data from a machine
1. Loading external dataset: Load the csv file using located in `../data/populations.csv` function using Pandas package as Dataframe. (`python -m pip install pandas`)

 Tips: 
 - Dataset: there is a file in the folder called `data` named `populations.csv`. Make sure it is in right location. The data describes the populations of hares and lynxes (and carrots) in northern Canada during 20 years.

- A [DataFrame](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#) is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. It is similar to a spreadsheet, an array or a table.

2. Compute its basic statistics.

In [None]:
import pandas as pd
pd.__version__
populations = pd.read_csv("../data/populations.csv")

populations.info()

In [None]:
populations.describe()

# Exercise:
1. carry out simple visualzation for the data above.
2. Try loading seaborn dataset carry out descriptive statistics. 
    - TIPS: refer to examples shown in: https://seaborn.pydata.org/tutorial/distributions.html
    - the list of dataset can be obtained from https://github.com/mwaskom/seaborn-data