In [1]:
import pandas as pd

YIELD = '/kaggle/input/worlds-wildlife/yields-of-important-staple-crops new.csv'

df = pd.read_csv(filepath_or_buffer=YIELD, 
                 thousands='.'
                )


df[(df['Maize yield'] != '0')].head()


Unnamed: 0,country,code country,Year,Maize yield
0,United Kingdom,GBR,1275,0.0
1,United Kingdom,GBR,1305,0.0
2,United Kingdom,GBR,1355,0.0
3,United Kingdom,GBR,1405,0.0
4,United Kingdom,GBR,1455,0.0


In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13549 entries, 0 to 13548
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   country       13549 non-null  object 
 1   code country  13549 non-null  object 
 2   Year          13549 non-null  int64  
 3   Maize yield   13501 non-null  float64
dtypes: float64(1), int64(1), object(2)
memory usage: 423.5+ KB


In [3]:
df.nunique()

country          241
code country     198
Year             172
Maize yield     9912
dtype: int64

In [4]:
import warnings
from plotly import express

warnings.filterwarnings(action='ignore', category=FutureWarning)
express.line(data_frame=df, x='Year', y='Maize yield', color='code country', height=900)

What a mess. We have a bunch of zero data, so let's ignore all the zero data.

In [5]:
express.line(data_frame=df[df['Maize yield'] > 0], x='Year', y='Maize yield', color='code country', height=900)

We only have US data prior to 1961 so let's ignore the early data.

In [6]:
express.scatter(data_frame=df[(df['Maize yield'] > 0) & (df['Year'] > 1960)], x='Year', y='Maize yield', color='code country', height=900)

What are these OWID countries?

In [7]:
df[df['code country'].str.startswith('OWID')]['country'].value_counts()

country
Melanesia                62
Polynesia                62
World                    62
Czechoslovakia           32
Ethiopia (former)        32
USSR                     31
Yugoslavia               31
Serbia and Montenegro    14
Name: count, dtype: int64

Let's take a look at the World data.

In [8]:
express.scatter(data_frame=df[(df['Maize yield'] > 0) & (df['Year'] > 1960) & (df['country'] == 'World')], x='Year', y='Maize yield', color='code country', height=900)

We already filtered out zero yields, so we apparently have two very different regimes of world yield data. Let's reload the raw data and see if we made a mistake.

In [9]:
raw_df = pd.read_csv(filepath_or_buffer=YIELD, )
raw_df[raw_df['country'] == 'World']['Maize yield'].value_counts()

Maize yield
1.9423        1
4.9827        1
4.112         1
38.100.998    1
4.2062        1
             ..
3.6279        1
3.4863        1
3.1001        1
3.6186        1
5.7183        1
Name: count, Length: 62, dtype: int64

Here we apparently have data that is differs by three orders of magnitude, which doesn't make sense. Also it looks like we have four digits in what should be a three-digit group.

Let's look at the USA data; do we have a good series for it?

In [10]:
express.scatter(data_frame=df[df['country'] == 'United States'], x='Year', y='Maize yield')

No; unfortunately this data isn't fit for purpose even for US data.