We have accumulated 4 datasets:
- World Food Wealth Bank - Crop Data
- World Food Wealth Bank - Live Stock Data
- World Food Wealth Bank - Population
- World GDP


Our dataset for the World Food Wealth Bank was too large to import into our Project 1 GitHub Repository. Therefore, we must narrow down the scope of our data.
We decided to focus our data towards the most recent year available within the World Food Wealth Bank data which is 2020.

In [2]:
# Import libraries
import pandas as pd

In [46]:
# import data
crop_df = pd.read_csv("./Foodbank_data/crop1.csv")
live_stock_df = pd.read_csv("./Foodbank_data/live1.csv")
pop_df = pd.read_csv("./Foodbank_data/pop1.csv")


In [47]:
crop_2020 = crop_df.loc[crop_df['Year'] == 2020, :]
crop_2020.head()

Unnamed: 0,Area,Item,Element,Year,Unit,Value
45,Afghanistan,"Almonds, with shell",Area harvested,2020,ha,22134.0
90,Afghanistan,"Almonds, with shell",Yield,2020,hg/ha,17759.0
136,Afghanistan,"Almonds, with shell",Production,2020,tonnes,39307.0
196,Afghanistan,"Anise, badian, fennel, coriander",Area harvested,2020,ha,25759.0
231,Afghanistan,"Anise, badian, fennel, coriander",Yield,2020,hg/ha,7138.0


In [48]:
live_2020 = live_stock_df.loc[live_stock_df['Year']==2020, :]
live_2020.head()

Unnamed: 0,Area,Item,Element,Year,Unit,Value
59,Afghanistan,Asses,Stocks,2020,Head,1535435.0
119,Afghanistan,Camels,Stocks,2020,Head,168928.0
179,Afghanistan,Cattle,Stocks,2020,Head,5085807.0
239,Afghanistan,Chickens,Stocks,2020,1000 Head,13724.0
299,Afghanistan,Goats,Stocks,2020,Head,7967043.0


In [49]:
# Columns
print(f"Crop Columns: {crop_2020.columns}")
print(f"Live Stock Columns: {live_2020.columns}")

Crop Columns: Index(['Area', 'Item', 'Element', 'Year', 'Unit', 'Value'], dtype='object')
Live Stock Columns: Index(['Area', 'Item', 'Element', 'Year', 'Unit', 'Value'], dtype='object')


Crop Columns:
- Area: Country/region
- Item: Crop
- Element: Area harvested, Yield, or Production of the crop
- Year: Only 2020 since that is our focus
- Unit: Unit of measure based on the element value
- Value: The number of units for the crop

Live Stock Columns:
- Area: Country
- Item: Live stock
- Element: Type
- Year
- Unit: Unit of measure for the element type
- Value: The value count based on the unit of measure

The column titles are the same as the crop dataset.


In [50]:
# Column Data Types
print(f"Crop data types: {crop_2020.dtypes}")
print(f"Live Stock data types: {live_2020.dtypes}")

Crop data types: Area        object
Item        object
Element     object
Year         int64
Unit        object
Value      float64
dtype: object
Live Stock data types: Area        object
Item        object
Element     object
Year         int64
Unit        object
Value      float64
dtype: object


Data types are correct and matching for both crop and live stock data.

In [51]:
# Missing Data
print(f"Crop value counts: {crop_2020.count()}")
print(f"Live stock value counts: {live_2020.count()}")

Crop value counts: Area       35406
Item       35406
Element    35406
Year       35406
Unit       35406
Value      34461
dtype: int64
Live stock value counts: Area       2046
Item       2046
Element    2046
Year       2046
Unit       2046
Value      2004
dtype: int64


In [23]:
# Value column has 34461 values
# What kind of values are missing?
crop_2020_missing = crop_2020.loc[crop_2020['Value'].isna(), :]
crop_2020_missing

Unnamed: 0,Area,Item,Element,Year,Unit,Value
8704,Albania,"Fruit, stone nes",Area harvested,2020,ha,
8719,Albania,"Fruit, stone nes",Production,2020,tonnes,
9966,Albania,Mushrooms and truffles,Area harvested,2020,ha,
12150,Albania,"Rice, paddy",Area harvested,2020,ha,
12243,Albania,"Rice, paddy",Production,2020,tonnes,
...,...,...,...,...,...,...
1861714,Low Income Food Deficit Countries,Flax fibre and tow,Production,2020,tonnes,
1866042,Low Income Food Deficit Countries,Mushrooms and truffles,Area harvested,2020,ha,
1883875,Net Food Importing Developing Countries,Hemp tow waste,Area harvested,2020,ha,
1883904,Net Food Importing Developing Countries,Hemp tow waste,Production,2020,tonnes,


Does the missing values mean that the country does not produce that crop? If so, the NaN value should be 0.

What is Low Income Food Deficit Countries?
What is Net Food Importing Developing Countries?

These are bundles of countries that are categorized into one group. The GDP dataset does not include these categories. Therefore, it will be removed when we do a left join by Country with the GDP dataset.

The value of any missing data can be 0.



In [29]:
crop_2020_no_missing = crop_2020
crop_2020_no_missing['Value'] = [val if type(val) == 'float' else 0 for val in crop_2020_no_missing['Value']]
len(crop_2020_no_missing['Value'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  crop_2020_no_missing['Value'] = [val if type(val) == 'float' else 0 for val in crop_2020_no_missing['Value']]


35406

In [31]:
# Check if it worked for all columns
crop_2020_no_missing.count()

Area       35406
Item       35406
Element    35406
Year       35406
Unit       35406
Value      35406
dtype: int64

In [52]:
# Isolate the rows with missing values for live stock data
live_2020_missing = live_2020.loc[live_2020['Value'].isna(), :]
live_2020_missing

Unnamed: 0,Area,Item,Element,Year,Unit,Value
839,Albania,Ducks,Stocks,2020,1000 Head,
899,Albania,Geese and guinea fowls,Stocks,2020,1000 Head,
1079,Albania,Mules,Stocks,2020,Head,
4068,Australia,Buffaloes,Stocks,2020,Head,
5616,Bahamas,Horses,Stocks,2020,Head,
5768,Bahrain,Asses,Stocks,2020,Head,
9442,Bolivia (Plurinational State of),Beehives,Stocks,2020,No,
15596,Central African Republic,Asses,Stocks,2020,Head,
18236,"China, Hong Kong SAR",Geese and guinea fowls,Stocks,2020,1000 Head,
21488,Cook Islands,Ducks,Stocks,2020,1000 Head,


Are there really no bees in the United Kingdom? There are no pigs in Singapore?

Did a google search and there are more than 250 species of bees in the UK. https://www.woodlandtrust.org.uk/blog/2019/05/types-of-bee-in-the-uk/

These missing values are inaccurate and it would be safe to remove them.

In [53]:
# Delete missing data
live_2020_no_missing = live_2020.dropna

In [35]:
# Check for duplicate rows
double_crop = crop_2020_no_missing[crop_2020_no_missing.duplicated()]
len(double_crop)


0

No duplicate rows.

In [36]:
# What are the unit measurements?
crop_2020_no_missing['Unit'].unique()

array(['ha', 'hg/ha', 'tonnes'], dtype=object)

Units: 
- ha = hectare (associated with the Element column value Area harvested)
- hg/ha = hectagram/hectare (associated with the Element column value Yield)
- tonnes (associated with the Element column value Production)

We will be focusing on the Production value. Therefore, we will only be keeping the rows where Element is "Production".

In [38]:
# Filter data to only include Production
crop_2020_production = crop_2020_no_missing.loc[crop_2020_no_missing['Element'] == 'Production', :]
crop_2020_production['Element'].unique()

array(['Production'], dtype=object)

In [39]:
# All the Unit values should be the same
crop_2020_production['Unit'].unique()

array(['tonnes'], dtype=object)

Unit only contains 'tonnes'. We can remove this data columns 'Element' and 'Unit' potentially later, and rename the column 'Value' to 'Production Value (tonnes)'.

In [13]:
# # Create CSV for 2016 crop data
# crop_2016.to_csv("./data_2016/crop_2016.csv")

### Live Stock Data

In [32]:
live_stock_df = pd.read_csv("./Foodbank_data/live1.csv")

In [33]:
live_stock_df['Year'].max()

2020

In [40]:
live_2020 = live_stock_df.loc[live_stock_df['Year']==2020, :]
live_2020.head()

Unnamed: 0,Area,Item,Element,Year,Unit,Value
59,Afghanistan,Asses,Stocks,2020,Head,1535435.0
119,Afghanistan,Camels,Stocks,2020,Head,168928.0
179,Afghanistan,Cattle,Stocks,2020,Head,5085807.0
239,Afghanistan,Chickens,Stocks,2020,1000 Head,13724.0
299,Afghanistan,Goats,Stocks,2020,Head,7967043.0


In [41]:
live_2020.columns

Index(['Area', 'Item', 'Element', 'Year', 'Unit', 'Value'], dtype='object')

In [None]:
# Look into the data types


In [18]:
# Taking a closer look at Cattle
live_2016_cattle=live_2016.loc[live_2016['Item']=="Cattle", :]
len(live_2016_cattle['Area'].unique())


227

In [19]:
double_cattle = live_2016_cattle[live_2016_cattle.duplicated()]
len(double_cattle['Area'].unique())
double_cattle

# No duplicate values for cattle per area

Unnamed: 0,Area,Item,Element,Year,Unit,Value


In [20]:
# Create CSV for 2016 live stock data
live_2016.to_csv("./data_2016/live_2016.csv")

### Population Data

In [21]:
pop_df = pd.read_csv("./Foodbank_data/pop1.csv")

In [22]:
pop_df.head()

Unnamed: 0,Country Name,Country Code,1960,1961,1962,1963,1964,1965,1966,1967,...,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021
0,Aruba,ABW,54208.0,55434.0,56234.0,56699.0,57029.0,57357.0,57702.0,58044.0,...,102565.0,103165.0,103776.0,104339.0,104865.0,105361.0,105846.0,106310.0,106766.0,107195.0
1,Africa Eastern and Southern,AFE,130836765.0,134159786.0,137614644.0,141202036.0,144920186.0,148769974.0,152752671.0,156876454.0,...,547482863.0,562601578.0,578075373.0,593871847.0,609978946.0,626392880.0,643090131.0,660046272.0,677243299.0,694665117.0
2,Afghanistan,AFG,8996967.0,9169406.0,9351442.0,9543200.0,9744772.0,9956318.0,10174840.0,10399936.0,...,31161378.0,32269592.0,33370804.0,34413603.0,35383028.0,36296111.0,37171922.0,38041757.0,38928341.0,39835428.0
3,Africa Western and Central,AFW,96396419.0,98407221.0,100506960.0,102691339.0,104953470.0,107289875.0,109701811.0,112195950.0,...,370243017.0,380437896.0,390882979.0,401586651.0,412551299.0,423769930.0,435229381.0,446911598.0,458803476.0,470898870.0
4,Angola,AGO,5454938.0,5531451.0,5608499.0,5679409.0,5734995.0,5770573.0,5781305.0,5774440.0,...,25107925.0,26015786.0,26941773.0,27884380.0,28842482.0,29816769.0,30809787.0,31825299.0,32866268.0,33933611.0


In [26]:
pop_df.columns

Index(['Country Name', 'Country Code', '1960', '1961', '1962', '1963', '1964',
       '1965', '1966', '1967', '1968', '1969', '1970', '1971', '1972', '1973',
       '1974', '1975', '1976', '1977', '1978', '1979', '1980', '1981', '1982',
       '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991',
       '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000',
       '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009',
       '2010', '2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018',
       '2019', '2020', '2021'],
      dtype='object')

In [24]:
pop_2016 = pop_df.loc[:, ('Country Name','Country Code','2016')]
pop_2016.head()

Unnamed: 0,Country Name,Country Code,2016
0,Aruba,ABW,104865.0
1,Africa Eastern and Southern,AFE,609978946.0
2,Afghanistan,AFG,35383028.0
3,Africa Western and Central,AFW,412551299.0
4,Angola,AGO,28842482.0


In [25]:
len(pop_2016['Country Name'])

266

In [27]:
# Import to CSV file for 2016 population data
pop_2016.to_csv("./data_2016/pop_2016.csv")