<img style="display: block; margin: 0 auto" src="https://images.squarespace-cdn.com/content/v1/645a878d9740963714b8f343/3efb24e3-9fb9-4bc7-b41e-7f36742ae747/2-2.jpg?format=1500w" alt="Lonely Octopus Logo">

**Please create a copy of the notebook in your gdrive to be able to edit it.**

**You can make a copy from the menu: File > Save a copy in Drive**

#Agricultural Sampling Frames <br>

Your Ministry of Agriculture is establishing a multiyear regional development plan that aims to support farmers financially. In the upcoming years they will need to carry out different field studies and analyses that impact hundreds of farmers. For example, they might decide to sponsor only one mode of irrigation and provide improved seeds for some selected crops. It is very costly to conduct an exhaustive census at different points in the development plan, which is where we need your help.

As an Analyst in the Department of Strategy and Statistics, you have to decide which method do we use to sample farmers. This is very important because the method you decide to use will be implemented multiple times by the data collection team whenever they're on the field and we need to capture the whole agricultural situation of the concerned region.

#### **Sampling and Inference**
During our last census, we have collected data for 606 farmers specialized in growing vegetables. You're tasked with taking samples from this dataset and comparing them against each other and the exhaustive dataset using 5 different methods:

1. Simple Random Sampling
2. Systematic Sampling
3. Replicated Sampling
4. Probability Proportional to Size Sampling
5. Stratified Sampling

<img src='https://hotcubator.com.au/wp-content/uploads/2020/07/Copy-of-Social-Business-1.png'> <br>

Be warned! This analysis will be rather lengthly, detailed and repetitive in some cases, which is because it is actually a real project! Sampling is a very practical thing and picking a good sampling method is a common problem relevant to almost all fields and industries. We're counting on you to help us choose an adequate method!




#### **Get to know the data!**

The Dataset contains 6 variables, among which 5 are categorical (qualitative) and 1 is numerical (quantitative):

1. **Production mode:**
  1. **Primary:** The same crop is planted year after year
  2. **In succession:** A different crop is planted each year (or each cycle)
  3. **In association:** Multiple crops are planted on the same field at the same time
  4. **Understory:** Crop is planted under trees (Can be in a forest)

2. **Irrigation:**
  1.   **Yes:** The farmer uses a water source besides rain
  2.   **No:** The farmer is strictly relying on rain (Pluvial or Rainfed Irrigation mode)
3. **Irrigation mode:**

  1.   **Localized:** Farmer uses a drip irrigation system (continuous drops of water). This method is the most efficient in terms of water usage.
  2.   **Gravity:** An open air canal linking the field to the water source (eg. river). This irrigation system uses a lot of water, especially since much of it is lost through land absorption on the way and evaporation.
  3. **Aspersion:** Water is brought to the plants in the form of artificial rain using sprinklers fixed across the field.
  4. **Pivot:** A mobile system that pumps water from a source to a long tube with sprinklers in the sky that crosses the entire field. This sky tube moves from one side to the other and irrigates the whole field.
  5. **Gravity, Localized:** Mixed
  6. **Localized, Pivot:** Mixed

4. **Crop:**
  1. Tomatoes
  2. Potatoes
  3. ...
5. **Greenhouse:**
  1. **No:** Not used
  2. **Small tunel:** A tunnel that's small in width but high in length
  3. **Big Tunel:** A tunnel that's high in width but small in length
  4. **Canarian:** A structure made of only wires and films
  5. **Multi-chapel:** Greenhouse with more than one chapel (curve)

6. **Field area:**
  - Planted field area in Ha (Hectares)




In [None]:
# Import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# import altair as alt (Graphs look better in quality than matplotlib)
import random
import math

In [None]:
# Import your data.
# Add an argument specifying the index column so it doesn't get treated as a numerical variable
df=pd.read_excel('Sampling Frame dataset.xlsx')

In [None]:
# Check variable types and null values
# Read the variable descriptions well and fill the null values adequately

Describe your numerical variable (Field_area)

In [None]:
# What is the average, minimum and maximum area planted ?
# Is the distribution of the area planted equal across the whole population ?
# What is the percentage of farmers holding 80% of the total area planted ?

# Visualize your findings

Describe your categorial variables

In [None]:
# How many times does each category occur in each variable ?
# How much percentage does each category take in its variable ?

# Visualize your findings

Dig deeper into the data

In [None]:
# Which crop has the highest Field area ?
# Are the numbers of farmers planting the same crops correlated to the sum of their planted area ?
# Is the Production mode used the same for the same crops ?
# Is the Irrigation mode used the same for the same crop ?
# Who uses Green houses and who doesn't ?
# Which crops are relying only on rain ?
# Does the Irrigation mode impact the planted area ?



# Vizualize these findings. Unleash your imagination to come up with more interesting questions

# Sampling

Determining an accurate sample size is an intermediate and very detailed field in statistics. However if you would like to specify your own please do. <br>
To start with the sampling immediately please use **200 observations** for all samples

Create functions for each sampling method. <br>
If it's easier to sample without using functions you can also do that

In [None]:
# Simple Random Sampling
def random_sampling(df, vector):
    ... # Fill me in!

# Systematic Sampling
def systematic_sampling(df, step):
    ... # Fill me in!

# Replicated Sampling
def replicated_sampling(df, step, vector):
    ... # Fill me in!


# Probability Proportional to Size Sampling
def pps_sampling()):
    ... # Fill me in!

# Stratified Sampling
def stratified_sampling():
    ... # Fill me in!

**Simple Random Sampling:**
This takes a totally random sample from the population. You should run this at least 3 times and compare the 3 against each other. You can also shuffle and sort the population as you run this sample. If you get consistent results it's a good sign.

**Systematic Sampling:**
This follows a *step* while sampling and skips rows equally.
$step =  \frac{N}{n}$ where $N$ is the Population size and $n$ the sample size. <br>
This should also be implemented at least a second time by starting from the second row (0,4,8,...) and then (1,5,9,...). You can also shuffle and sort as mentioned above.

**Replicated Sampling:**
This combines both Random and Systematic sampling. Take equal samples of 100 each using each method and combine them. Do NOT take unequal samples, that's for the next sampling type.

**Stratified Sampling:**
A bit more advanced. With this method you choose a measure of size for one of your numerical variables. We recommend using the median. You then split the population into two subsets; one containing rows lower than the median and the other the rows higher or equal to the median. You will take samples from each following this formula:

$ns_{h} = n \frac{C_{h} X_h^q}{\sum_{h}C_{h} X_h^q}$  with $q=\frac{1}{2}$ and $C_{h}=\frac{S_{h}}{\bar{Y_{h}}}$

Where:
- $h$= ID of the dataset (1 or 2) <br>
- $n$ = Sample size (200) <br>
- $S$ = Standard deviation of your numerical variable in that specific data subset (Careful you will have two, $S_1$ and $S_2$) <br>
- $\bar{Y}$ = Mean of your numerical variable in that data subset (you will have two, $Y_1$ and $Y_2$) <br>
- $X$ = Median of your numerical variable in that data subset (you will have two, $X_1$ and $X_2$) <br>

In the end you will have two new "mini sample sizes" $ns_1$ and $ns_2$ that add up to $n$. Extract a random sample from each subset using its respective mini sample size and combine both in one dataframe. Your stratified sample is now ready !

**Probability Proportional to Size Sampling (PPS):**
This one is the most advanced (more than stratified) and cannot be explained in one page. We count on you to refer to this <a href="https://cdn.who.int/media/docs/default-source/hq-tuberculosis/global-task-force-on-tb-impact-measurement/meetings/2008-03/p20_probability_proportional_to_size.pdf?sfvrsn=51372782_3">document</a>  to grasp the concept and implement it !


**Analyze all samples**

- Were the results consistent when you implemented random sampling 3 times ? For eg, did your numerical variable still follow the same distribution ?
- Were any crops nonexistent in the samples ?

- Plot 6 distributions of your numerical variable next to each other or on the same graph, 5 for each sampling frame and one for the population.

- Create a table where you compare your numerical variable characteristics against the same from other samples. For example, divide all sample means, medians and standard deviations by the population mean, median and standard deviation. The closest the division is to 1 the better the sample.

- Plot the percentage each crop takes in dataset 6 times, 5 for each sampling frame and one for the population as mentioned above.

**Answer the same questions you have answered earlier to analyze the samples and compare all the results. It is a lengthy exercise but extremely important. Determine which sampling method displays an accurate representation of the population.**



# Inference

Inference as a definition is straightforward. Create a population from each sample method. For clarity we'll provide the following example :

Since we took a random sample of $200$ of the population, If we have for eg $100$ farmers planting potatoes in the sample, it would probably be $100*\frac{606}{200} = 303$ for the population, where 606 is the population size and 200 the sample size

In [None]:
P = len(df)/len(df_random)
Estimated_Population_Crop = df_random.Crop.value_counts() * P

- Carry out this same operation for all variables in all of the samples you took and compare them against the original population.

- Note: Random, Systematic and Replicated sampling use the same Inference method ( multiply by a probability of $\frac{N}{n}$ as mentioned above). This means that all 606 farmers have the same weight in the population. For Stratified and PPS sampling you **CANNOT** use the same method, you will use at least 2 different probabilities to form a population of the same size as the original. **Hint:** Try to exactly reverse gear the sampling process using the same mini sample sizes you used.

# Dimensionality Reduction & Factor Analysis

**Note:** The following sections can be implemented with Python, but R may be even more suitable. Btw the debate between Python vs. R is fake! Python are R are good for different things and data analysts and data scientists often use both.

Before you start, download [this file](https://drive.google.com/file/d/1VEUl4AqmvgqtPGGcnMxXE6E81tvNn3C4/view?usp=share_link) and read the entire analysis with interpretations. It's a simple analysis in R but the goal is not to understand the syntax if you will use Python, but the **graphs and conclusions** associated with each one.

- Use **Multiple Correspondence Analysis (MCA)** to plot all 6 variables on a 2D plot. This is called Dimensionality reduction and it allows us to group individuals with similar profiles and check associations between variable categories.


In [None]:
# Convert your numerical variable (Field_area) to a categorial variable
# Create intervals based on the variable's distribution

V = []
for i in range(0,len(df.Field_area)):
  if df.Field_area[i] >= X and < Y :
    # ... Fill me in!

- Plot the MCA graph and be **descriptive** in your interpretations ! Which variables are closest to each other ? Which variable categories are closest to each other ? What does it mean ? For example, are farmers more likely to use the same production and irrigation modes if they plant the same or similar crops ?

# Clustering

Start with **Ascending Hierarchical Clustering (AHC) to cluster individual farmers**. Your goal is to cluster farmers in similar groups using a **Dendrogram**. Please refer to the tutorial above for context. You will use the Individuals' coordinates on the MCA's axes to plot this.

Example of Dendrogram clustering US states from online.visual-paradigm.com. Your graph should end with 606 leaves (ends of Dendrogram) ![image.png](https://online.visual-paradigm.com/repository/images/0fe81efd-c6f6-41af-98d5-d9b1f0d33f2f.png)


Finish the analysis with **k-means clustering to cluster variables and their categories.** You will use the variables' coordinates on the MCA's axes to plot this. The Dendogram clusters individuals into groups automatically. In k-means we need to specify the number of clusters before creating the chart. This number of clusters is why we start with the Dendogram (In the above example it's 4 clusters).
Using the number you find create a k-means chart ! Be **descriptive** in your interpretation.

# Go Beyond Plus Ultra !

- Create a Dendogram for each sampling frame. Check if the individuals clustered in the sample are in the same group compared to the previous section on the population.

- Check the same for the variables and categories clustered.

- Does this align with your Sampling & Inference analysis on the best sampling frame ?

- Summarize all findings to your stakeholder (Dashboard)

<img src="https://i.pinimg.com/736x/f0/4a/08/f04a08853d407a93e6a06f1ce10c8173--poetry-inspiration-top-hats.jpg">