### Today's agenda

* Sample and Population
* Merits and Demerits of sampling
* Types of sampling
* Random sampling
    - Implementing the same using pandas
* Predictive analytics
* Importance of sampling in PA

<h1><center>Sample & Population - Predictive Analytics</center></h1>

### Population

* A set of similar items or events which is of interest for some question or experiment.
* We denote the population as `N`.

### Sample

* A subset of the population (a statistical sample) that is chosen to represent the population.
* We denote the sample as `n`.

### Sampling
(method)

* A selection of subset of individuals from statistical population to estimate the characteristics for the whole population.
* It is one such technique that is applied by everyone in our day to day activities.

<img src="http://1.bp.blogspot.com/-R2L1EbgRwIs/UpEKoV7_6tI/AAAAAAAABAg/x6B-knDyoEI/s1600/population.gif">

<br>

**Credits** - Image from Internet

### Note

* By taking sample, statisticians tend to infer or conclude the characteristics/estimates to the whole population.

### Example

* Imagine you have a piece of land and you want to know if the land is fertile enough to grow plants.

<br>

<img src="https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fwiki.urbandead.com%2Fimages%2F0%2F0a%2FWaste_land.JPG">

<br>

* **Scenario 1**
    - Interpret the land's fertility by testing the whole land.
* **Scenario 2**
    - Interpret the land's fertility by just testing a sample (soil) in a container or jar.

<br>

**Credits** - Image from Internet

### Merits & Demerits

**Merits**

* Less cost effective
* Time saving
* Higher accuracy

**Demerits**

* Chances of biasness
* Need of subject specific knowledge

### Types of Sampling

1. **Probability Sampling**
    - `Simple Random Sampling`
        - It is a randomly selected subset where each member of the population has an exactly equal chance of being selected.
        - From the random sample that is selected, researcher tends to make statistical inferences on the whole population.
    - Systematic Sampling
    - Cluster Sampling
    - Stratified Sampling
2. **Non-Probability Sampling**
    - Convenience Sampling
    - Judgmental Sampling
    - Snowball Sampling
    - Quota Sampling

In [1]:
import pandas as pd
import numpy as np

### Population data

Get random integers in the range of `low` and `high`

* size → (how_many_rows, how_many_columns) - (1000, 3)

Make random data using pandas

In [2]:
# rand_data (population)
rand_data = np.random.randint(low=10, high=100, size=(1000, 3))

In [3]:
# display rand_data
rand_data

array([[37, 69, 58],
       [84, 32, 88],
       [73, 62, 99],
       ...,
       [94, 19, 13],
       [59, 13, 21],
       [51, 60, 97]])

Create a dataframe with columns and data generated

In [4]:
# df
df = pd.DataFrame(data=rand_data, columns=['col_x', 'col_y', 'col_z'])

In [5]:
# head()
df.head()

Unnamed: 0,col_x,col_y,col_z
0,37,69,58
1,84,32,88
2,73,62,99
3,29,81,96
4,14,44,28


Population data (df) size is 1000

* N = 1000

In [6]:
# shape
df.shape

(1000, 3)

### Simple random sample

* Select a sample dataframe from population (df) of size 100
* n = 100

In [7]:
# rand_sample_df
# dir(df)

rand_sample_df = df.sample(n=100, random_state=2)

In [8]:
# shape
rand_sample_df.shape

(100, 3)

In [9]:
# head
rand_sample_df.head()

Unnamed: 0,col_x,col_y,col_z
37,60,58,60
726,53,71,18
846,11,84,31
295,81,28,30
924,35,90,21


A descent way of sampling can be achieved by `frac`

In [10]:
# frac
# help(df.sample)

rand_sample_df_f = df.sample(frac=0.40)

In [11]:
# head
rand_sample_df_f.head()

Unnamed: 0,col_x,col_y,col_z
949,77,73,57
683,25,87,56
190,98,23,68
640,56,10,55
307,56,62,80


In [12]:
# shape
rand_sample_df_f.shape

(400, 3)

In [13]:
# help(df.sample)

<center><h1>Predictive Analytics</h1></center>

Predictive analytics encompasses a variety of statistical techniques from `data mining`, `predictive modelling`, and `machine learning`, that analyze current and historical facts to make predictions about future or otherwise unknown events.

<br>

<img src="http://bigdata-madesimple.com/wp-content/uploads/2016/09/Predictive-analytics.png">

<br>

**Credits** - Image from http://bigdata-madesimple.com/

### Machine Learning

* ML is a technique followed to make a computer learn from the previous experience in order to make an assumption for the future outcome.
* It can learn and adapt to the new data without any human intervention.
* It needs prior training so that it can be tested to the new data.

### What is this???

<img src="https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fwww.webindia123.com%2Fpets%2Fcats%2Famerican.jpg">

### ML and Traditional Programming

* **Traditional Programming** → Inputs are known, programer writes the logic to obtain the Output.

<img src="https://raw.githubusercontent.com/msameeruddin/Data-Analysis-Python/main/6_DA_Sample_PA/traditional_programming.png">

<br>

* **Machine Learning** → Inputs and Outputs are known, the algorithm tries to design it's own logic to map the inputs with the outputs.

<img src="https://raw.githubusercontent.com/msameeruddin/Data-Analysis-Python/main/6_DA_Sample_PA/ml_programming.png">

<br>

**Images by Author**

### ML and Mathematics

* ML is just the tip of the iceberg.
* Math and Python code (algorithms) holding the iceberg is what we should be understanding.

<br>

<img src="https://external-content.duckduckgo.com/iu/?u=http%3A%2F%2Fwww.heathertforbes.com%2Fenewsletter%2Fimages%2Ficeberg.png&f=1&nofb=1">

<br>

**Credits** - Image from Internet

### Examples

* Email spam detector
* Auto-completion mode in the email
* Google photoes classification
* Weather forecasting - Time series prediction
* ...

### Types of ML

* **Supervised Learning**
    - The computer is presented with both example inputs and their respective outputs. The algorithm learns a general rule to map the inputs to the outputs.
* **Unsupervised Learning**
    - No outputs are given to the learning algorithm, instead the algorithm alone has to figure out the structure in the inputs and find the hidden patterns to get the final end.
* **Reinforcement Learning**
    - Works based on the reward system and the ultimate goal is to maximize the reward score.

### How much data do you really need for building a predictive model?

Often times, we have been told that - to build a machine learning predictive model, we need to have large amounts of data. Well that depends ultimately.

* Effective sampling is about maximizing the about (information) of the whole population from the sampling unit.
* A small random probability sample, as long as it is truly random and not biased in any way, can have very high predictive power.
* With less also, you can achieve more.

**More information** → https://www.sv-europe.com/blog/predictive-analytics-much-data-really-need/