***
# Brunei MYCE 2022 
## Workshop: Exploring Data with Python
## Date: 11th June 2022


***
# <ins>Workshop content</ins>

- Python
    - Isn't it just a snake?
- Common libraries for data exploration
    - pandas, numpy - for tabular data
    - numpy, scipy, scikit-learn - statistics, machine learning models
- Data wrangling
    - Example - Brunei's Chicken meat statistics
- Data exploration & visualisation
    - Basic statistics
- Modelling
    - trend analysis
- Looking back at the helicopter experiment
- What's next?

***
# ![Python image](https://www.python.org/static/img/python-logo.png)
Image sources : [Python webpage](https://www.python.org)
- Taken from the official page
> "Python is a programming language that lets you work more quickly and integrate your systems more effectively."
- Open-source
- Top programming language for data scientists, according to this [Datacamp](https://www.datacamp.com/blog/top-programming-languages-for-data-scientists-in-2022) article
- Plenty of libraries to choose from

***
# Common Python libraries for data exploration
## Pandas https://pandas.pydata.org
> pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
- A common library to analyse structured tabular data, e.g. excel, SQL tables etc.
    
## Numpy https://numpy.org
> The fundamental package for scientific computing with Python
- A common library for handling and manipulating arrays, pandas is using this in the background.

## SciPy https://scipy.org/
> Fundamental algorithms for scientific computing in Python
- Contains useful and optimized algorithms for modelling/analysing data

## Scikit-learn https://scikit-learn.org/
> - Simple and efficient tools for predictive data analysis
> - Accessible to everybody, and reusable in various contexts
> - Built on NumPy, SciPy, and matplotlib
> - Open source, commercially usable - BSD license
- Contains libraries for machine learning

## Other notable libraries for neural networks
- tensorflow https://www.tensorflow.org/
- keras https://keras.io/


***
# Common Python libraries for data visualisation
## Matplotlib https://matplotlib.org/
> Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Matplotlib makes easy things easy and hard things possible.
- A common library for 2d plotting and some 3d plotting capabilities.
    
## Seaborn https://seaborn.pydata.org/
> Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- Nicer visualisation charts compared to matplotlib.

***
# Data wrangling - with example
Data source: https://www.data.gov.bn/Pages/index.aspx

- Before we can explore the data, we need to read and reformat the data into something usable.
- This process is part of data engineering.
- For this workshop, we will be using data that is freely available from the data.gov.bn website.
- For this example, lets use the [Monthly Local Chicken Meat Production by District 2014-2020](https://www.data.gov.bn/Lists/dataset/mdisplay.aspx?ID=1009)
- We need to copy the link to the excel file and store it in a variable

In [None]:
data_url = 'https://www.data.gov.bn/Lists/dataset/Attachments/1009/Monthly%20Local%20Chicken%20Meat%20Production%20by%20District%20(2014-2020).xlsx'


- Since this is an excel file, we will use pandas to read in the file.
- Don't forget to import the pandas module

- Let's view the data

- Here is the data engineering part

In [None]:
# Get the district list


- Congratulations.  You have successfully done the data preprocessing.

***
# Exploring the data... but first some theories
## Basic statistics
![box_normal](https://www.researchgate.net/publication/340996565/figure/fig3/AS:962186885210119@1606414638167/Box-plot-and-probability-density-function-of-a-normal-distribution.png)<p />
Image source (https://www.researchgate.net/figure/Box-plot-and-probability-density-function-of-a-normal-distribution_fig3_340996565)
## Some definitions
- ## mean
    - average value of the distribution
- ## median
    - the midpoint value in the distribution
    - for a normal distribution, **median** = **mean**
- ## σ (standard deviation)
    - a measure of data spread in the distribution
    - Below is how the data is being distributed for a normal distributed data. 
    - ![standard_dev](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/440px-Standard_deviation_diagram.svg.png)
    - Image source (https://upload.wikimedia.org/wikipedia/commons/thumb/8/8c/Standard_deviation_diagram.svg/440px-Standard_deviation_diagram.svg.png)
    - see also the 68-95-99.7 rule ([wiki](https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule))
- ## Q1 & Q3 (First and third quartile)
    - values at which 25% and 75% of the data is less than the indicated value
- ## IQR (Inter-quartile range)
    - the distance between Q1 and Q3
- ## Outliers
    - the data that is more than **Q3 + 1.5 * IQR** or less than **Q1 - 1.5  * IQR**


***
# Exploring the data
- Since we have the data in pandas dataframe, let's start to investigate it via pandas

In [None]:
# Display the data type via info()


In [None]:
# Try to get some statistics out via describe()


In [None]:
# Only the year showed up, let's fix these


In [None]:
# Lets look at the unique values for district and month


In [None]:
# Let's look at the data in visually using matplotlib


- There are a few issues with the data.
- However it is not easy to spot the problems.
- We can derive new features based on the existing data.  Hopefully this new feature can easily reveal the errors in the data.

In [None]:
# Create a new column to store the unit price per kg


In [None]:
# Try plotting the new feature now


- From the new feature, what is wrong with the data.?

- Now let's fix the problems

In [None]:
# Wrong Retail_price_BND for Tutong 2019 data


In [None]:
# Wrong units


In [None]:
# Recalculate unit price


In [None]:
# Replot to verify data is now correct


- Other things that can be done is to look at the distribution of the data under various metrics.
- Let's use seaborn to display the information

In [None]:
# Use seaborn
# Plotting distribution of unit price over the years


In [None]:
# Plotting distribution of quantity per district


In [None]:
# Plotting distribution of quantity for Temburong only


***
# Data modelling
- Once you have explroed the data and sorted out the issues, what do you do next?
- This will depend on the questions you have or the problem you want to solve.
- Typically, how you answer the question is through data modelling


## Q: Is the meat production increasing over the years? and how much?

In [None]:
# let's plot the data


- Is this enough to say yes?

## Making use of statistics & hypothesis test

Looking at the data visually to determine if there is real is not enough.  You have to be sure that there is enough evidence from the data in hand.

Linear regression is one way to determine if the slope seen on the data is statistically significant.
- Linear regression is a straight line fitting algorithm.
- It provides the slope and intercept of the fitted line.
- It also provides some other statistics, such as the r-squard and the p-value
    - r-squared indicates how predictable the y value based on x, 1 = very predictable, 0 = not predictable
    - p-value indicates how significant the slope & intercept are, 

![image.png](attachment:29d53b13-dc6c-422f-9261-6c1f5fafd757.png)

Below is our hypothesis for the problem.
1. null hypothesis => There is no slope
2. alternative hyothesis => there is a slope.
3. p value cutoff = 0.05

Let's make use of linear regression to find the slope and the p value via scipy.stats.linregress

- because the p-value is greater than the cutoff value (0.05), we accept the null hypothesis, i.e. there is no slope, based on the data available.

***
# Looking back at the helicopter experiment

![image.png](attachment:47aebb97-8758-4b9b-88a5-39bdfdb1e342.png)

- Read the data
- reformat into something useful
- data exploration
- hypothesis testing

***
# What's Next

- So far, we have only one data source.
- Things are more interesting when you start to combine multiple sources of data.
    - How do you merge them?
    - What features can you derive from them?
    - Are they significant enough to solve the problem?
- There are a lot of statistical models you could use
    - ARIMA (autoregressive integrated moving average, [wiki](https://en.wikipedia.org/wiki/Autoregressive_integrated_moving_average)) can model time-series data, useful for forecasting.
    - Classification is useful to desribe data into smaller groups

## Key takeaways
- Do not trust the data as it is.  Need to understand and verify if they are correct.
- Learn from the data, and use the power of statistics to back it up.
