# Business Understanding
- An NGO seeks to provide technical support as well as farm inputs to the farmers experiencing extremely low yield in the Karamoja region of Uganda.



# Problem Statement
- The NGO lacks visibility into the overall state of the region and often needs to rely on some very local sources of information to prioritize their activities.
Dalberg Data Insights (DDI) has been requested to develop a new food security monitoring tool to support the decision making of the NGO


# Objectives


- Regions with lowest total yield
- Regions with lowest sorghum yield per HA
- Regions with lowest maize yield per HA
- Correlation between population size and yield
- Correlation between crop area and yield

##  Success criteria
* Understanding the areas that have the lowest yields in order for the NGO to prioritize in terms of resource distribution.

# Data Understanding


In [None]:
# importing the neccessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Loading the data sets
district_crop_yield = pd.read_csv('/content/Uganda_Karamoja_District_Crop_Yield_Population.csv')
district_crop_yield.head()

In [None]:
subcounty_crop_yield = pd.read_csv('/content/Uganda_Karamoja_Subcounty_Crop_Yield_Population.csv')
subcounty_crop_yield.head()

In [None]:
# Checking the structure of the data sets
district_crop_yield.info()

In [None]:
subcounty_crop_yield.info()

In [None]:
# summary of descriptive statistics
district_crop_yield.describe()

In [None]:
subcounty_crop_yield.describe()

In [None]:
# Showing correlation of the numerical columns only and presenting it in a heat map
district_crop_yield.select_dtypes(include='number').corr()

In [None]:
sns.heatmap(district_crop_yield.select_dtypes(include='number').corr(), annot=True)
plt.rcParams['figure.figsize']=(20,10)
plt.show()

In [None]:
subcounty_crop_yield.select_dtypes(include='number').corr()

In [None]:
sns.heatmap(subcounty_crop_yield.select_dtypes(include='number').corr(), annot=True)
plt.rcParams['figure.figsize']=(20,10)
plt.show()

# Data Cleaning


In [None]:
# checking for null values
district_crop_yield.isna().sum()

In [None]:
subcounty_crop_yield.isna().sum()

No null values found in both data sets

In [None]:
# Checking columns in each dataset
district_crop_yield.columns

In [None]:
subcounty_crop_yield.columns

In [None]:
# creating new fields to show total production per district and sub-county

In [None]:
district_crop_yield['Total_Yield'] = district_crop_yield['M_Prod_Tot'] + district_crop_yield['S_Prod_Tot']
district_crop_yield.head()

In [None]:
subcounty_crop_yield['Total_Yield'] = subcounty_crop_yield['M_Prod_Tot'] + subcounty_crop_yield['S_Prod_Tot']
subcounty_crop_yield.head()

In [None]:
# Removing duplicates

In [None]:
district_crop_yield.shape

In [None]:
district_yield_cleaned = district_crop_yield.drop_duplicates()
district_yield_cleaned.shape

In [None]:
subcounty_crop_yield.shape

In [None]:
subcounty_yield_cleaned = subcounty_crop_yield.drop_duplicates()
subcounty_yield_cleaned.shape

The data sets do not have duplicate values

# EDA - Exploratory data analysis


In [None]:
# Checking regions with lowest yield

In [None]:
# 4 districts with least yield overally
district_yield_cleaned = district_yield_cleaned.sort_values(by='Total_Yield')
print(district_yield_cleaned.head(4)['NAME'])

In [None]:
# Presenting Disrict yield in a bar graph
sns.barplot(data=district_yield_cleaned, x='NAME', y='Total_Yield', palette='viridis')
plt.title('Total Yied per District', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.show()

In [None]:
# 10 subcounties with the least yield
subcounty_yield_cleaned = subcounty_yield_cleaned.sort_values(by='Total_Yield')
print(subcounty_yield_cleaned.head(10)['SUBCOUNTY_NAME'])

In [None]:
# Presenting subcounty yield in a bargraph
sns.barplot(subcounty_yield_cleaned, x='SUBCOUNTY_NAME', y='Total_Yield', palette='magma')
plt.title('TOTAL YIELD PER SUB-COUNTY', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Lowest sorghum yield district and subcounty wise
district_yield_cleaned = district_yield_cleaned.sort_values(by='S_Yield_Ha')
print(district_yield_cleaned.head(4)['NAME'])

In [None]:
# Presenting district sorghum yield in a bargraph
sns.barplot(data=district_yield_cleaned, x='NAME', y='S_Yield_Ha', palette='Blues')
plt.title('Sorghum Yield per HA District Wise', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.show()

In [None]:
subcounty_yield_cleaned = subcounty_yield_cleaned.sort_values(by='S_Yield_Ha')
print(subcounty_yield_cleaned.head(10)['SUBCOUNTY_NAME'])

In [None]:
# Presenting subcounty sorghum yield in a bargraph
sns.barplot(subcounty_yield_cleaned, x='SUBCOUNTY_NAME', y='S_Yield_Ha', palette='Greens')
plt.title('Sorghum Yield per HA Sub-County Wise', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Lowest maize yield district and subcounty wise
district_yield_cleaned = district_yield_cleaned.sort_values(by='M_Yield_Ha')
print(district_yield_cleaned.head(4)['NAME'])

4     MOROTO
6      NAPAK
2    KAABONG
0       ABIM
Name: NAME, dtype: object


In [None]:
# Presenting district maize yield in a bargraph
sns.barplot(data=district_yield_cleaned, x='NAME', y='M_Yield_Ha', palette='Oranges')
plt.title('Maize Yield per HA District Wise', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.show()

In [None]:
subcounty_yield_cleaned = subcounty_yield_cleaned.sort_values(by='M_Yield_Ha')
print(subcounty_yield_cleaned.head(10)['SUBCOUNTY_NAME'])

33    SOUTHERN DIVISION
31                TAPAC
32    NORTHERN DIVISION
45               MATANY
44               LOTOME
47            NGOLERIET
29           KATIKEKILE
30             NADUNGET
9              KALAPATA
6         KAABONG  EAST
Name: SUBCOUNTY_NAME, dtype: object


In [None]:
# Presenting subcounty maize yield in a bargraph
sns.barplot(subcounty_yield_cleaned, x='SUBCOUNTY_NAME', y='M_Yield_Ha', palette='Purples')
plt.title('Maize Yield per HA Sub-County Wise', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.xticks(rotation=45, ha='right')
plt.show()

In [None]:
# Correlation between population and total production
sns.scatterplot(data=district_yield_cleaned, x='POP', y='Total_Yield')
plt.title('District Population against Total Production', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
np.corrcoef(district_yield_cleaned['POP'],district_yield_cleaned['Total_Yield'])[0,1]

In [None]:
sns.scatterplot(data=subcounty_yield_cleaned, x='POP', y='Total_Yield')
plt.title('Sub-County Population against Total Production', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.xticks(rotation=45, ha='right')
np.corrcoef(subcounty_yield_cleaned['POP'],subcounty_yield_cleaned['Total_Yield'])[0,1]

In [None]:
# Correlation between crop area and total production
sns.scatterplot(data=district_yield_cleaned, x='Crop_Area_Ha', y='Total_Yield')
plt.title('District Crop Area against Total Production', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
np.corrcoef(district_yield_cleaned['Crop_Area_Ha'],district_yield_cleaned['Total_Yield'])[0,1]

In [None]:
sns.scatterplot(data=subcounty_yield_cleaned, x='Crop_Area_Ha', y='Total_Yield')
plt.title('Sub-County Crop Area against Total Production', fontsize=20)
plt.xlabel('Districts', fontsize=20)
plt.ylabel('Total Yield', fontsize=20)
plt.xticks(rotation=45, ha='right')
np.corrcoef(subcounty_yield_cleaned['Crop_Area_Ha'],subcounty_yield_cleaned['Total_Yield'])[0,1]

## key findings
- 4 Districts with lowest total yield:

     * MOROTO
     * ABIM
     * AMUDAT
     * NAPAK

- 10 Sub-Counties with lowest total yield:

     * SOUTHERN DIVISION
     * NORTHERN DIVISION
     * AMUDAT TOWN COUNCIL
     * NAKAPIRIPIRIT TOWN COUNCIL
     * KAABONG TOWN COUNCIL
     * KATIKEKILE
     * LODIKO
     * KAABONG  EAST
     * NGOLERIET
     * TAPAC

- 4 Districts with lowest sorghum yield per HA:

    * MOROTO
    * NAPAK
    * AMUDAT
    * KAABONG

- 10 Sub-Counties with lowest sorghum yield per HA:

     * LOPEEI
     * RUPA
     * SOUTHERN DIVISION
     * LOKOPO
     * LOTOME
     * MATANY
     * NORTHERN DIVISION
     * NADUNGET
     * NGOLERIET
     * AMUDAT TOWN COUNCIL

- 4 Districts with lowest maize yield per HA:

     * MOROTO
     * NAPAK
     * KAABONG
     * ABIM

- 10 Sub-Counties with lowest maize yield per HA:

     * SOUTHERN DIVISION
     * TAPAC
     * NORTHERN DIVISION
     * MATANY
     * LOTOME
     * NGOLERIET
     * KATIKEKILE
     * NADUNGET
     * KALAPATA
     * KAABONG  EAST

- Correlation between population size and total yield

There is a weak but positive correlation between population size and total yield.

- Correlation between crop area and total yield

There is a strong and positive correlation between crop area and total yield.

# Conclusions
1. Regions with Lowest Total Yield

The findings highlight the following districts and sub-counties with the lowest total yield.

  * Districts: Moroto, Abim, Amudat, and Napak.

  * Sub-Counties: Southern Division, Northern Division, Amudat Town Council, Nakapiripirit Town Council, Kaabong Town Council, Katikekile, Lodiko, Kaabong East, Ngoleriet, and Tapac.

2.  Regions with Lowest Maize Yield per HA

The findings identify the following districts and sub-counties that are struggling specifically with sorghum production.

  * Districts: Moroto, Napak, Amudat, and Kaabong.

  * Sub-Counties: Lopeei, Rupa, Southern Division, Lokopo, Lotome, Matany, Northern Division, Nadunget, Ngoleriet, and Amudat Town Council.

3.  Regions with Lowest Maize Yield per HA
The findings point to the districts and sub-counties with the lowest maize yields per hectare.

  * Districts: Moroto, Napak, Kaabong, and Abim.

  * Sub-Counties: Southern Division, Tapac, Northern Division, Matany, Lotome, Ngoleriet, Katikekile, Nadunget, Kalapata, and Kaabong East.

4.  There is a weak but positive correlation between population size and total yield, suggesting that as the population increases, total yield tends to increase as well.

5. The more area a region has allocated for crops, the higher the chance of a bigger yield.

# Recommendations /Next steps
1. As the region also suffers from pests and disease outbreak, it would be important to explore the effect of these on crop production as well.
2. It is important to also analyze the impact of farming practices used across the region on different crop yields.
3. Population is a huge factor in farming. It is worth exploring how population density and not just the absolute numbers, affects productivity.
