In [59]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk

1. Data Exploration

a. Explore the dataset by displaying the first few rows, summary statistics, and data types of each column.

In [53]:
# Import the smoking dataset
df = pd.read_csv('../../data/food_bank/crop1.csv')
# Display first few rows
print(df.head())
# Summary
df.describe()
# Data types
print(df.dtypes)

          Area                 Item         Element  Year Unit   Value
0  Afghanistan  Almonds, with shell  Area harvested  1975   ha     0.0
1  Afghanistan  Almonds, with shell  Area harvested  1976   ha  5900.0
2  Afghanistan  Almonds, with shell  Area harvested  1977   ha  6000.0
3  Afghanistan  Almonds, with shell  Area harvested  1978   ha  6000.0
4  Afghanistan  Almonds, with shell  Area harvested  1979   ha  6000.0
Area        object
Item        object
Element     object
Year         int64
Unit        object
Value      float64
dtype: object


b. Identify missing values, outliers, and unique values in categorical columns.

In [54]:
# Identify missing values
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

# Identify unique values in categorical columns
categorical_columns = df.select_dtypes(include=['object']).columns
unique_values = {col: df[col].unique() for col in categorical_columns}

print(f"Unique Areas:{len(unique_values.get('Area'))}")
print(f"Unique Elements:{len(unique_values.get('Element'))}")
print(f"Unique Items:{len(unique_values.get('Item'))}")
print(f"Unique Units:{len(unique_values.get('Unit'))}")

Missing Values:
 Area            0
Item            0
Element         0
Year            0
Unit            0
Value      129500
dtype: int64
Unique Areas:245
Unique Elements:3
Unique Items:118
Unique Units:3


We find that there are lots of missing "value" values.
These are often the same crop, thus we have no good data for that particular crop anyways.
We might just drop the rows where the "value" is missing as it will not help improve our model

2. Data Cleaning

In [55]:
# Filter out categorial columns, leaving only numerical
df_numerical = df.select_dtypes(include=[np.number])
df_numerical.head()

Unnamed: 0,Year,Value
0,1975,0.0
1,1976,5900.0
2,1977,6000.0
3,1978,6000.0
4,1979,6000.0


3. Handling Outliers

a. Detect outliers using methods such as the IQR method or Z-score.

In [56]:
# Identify outliers using the IQR method
Q1 = df_numerical.quantile(0.25)
Q3 = df_numerical.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df_numerical < (Q1 - 1.5 * IQR)) | (df_numerical > (Q3 + 1.5 * IQR))).sum()
print("\nOutliers:\n", outliers)


Outliers:
 Year          0
Value    246096
dtype: int64


b. Decide whether to remove, cap, or transform the outliers. Justify your decisions

4. Data Transformation

a. Encoding Categorical Data

i. Apply label encoding or one-hot encoding to transform categorical data into 
numerical form.

In [57]:
# Apply one-hot encoding to categorical columns
categorial_df = pd.get_dummies(df, columns=['Area', 'Element', 'Item', 'Unit'])
categorial_df.head()

# Print columns amount
print(f"Columns amount: {len(categorial_df.columns)}")

Columns amount: 371


Why do we use one-hot encoding over label encoding?


b. Feature Scaling

i. Apply feature scaling techniques such as normalization (Min-Max scaling) or 
standardization (Z-score normalization) to the dataset.

ImportError: cannot import name 'MinMaxScaler' from 'sklearn' (/usr/local/lib/python3.11/site-packages/sklearn/__init__.py)

ii. Explain why feature scaling is necessary and how it impacts the model.

5. Data Splitting

a. Split the preprocessed dataset into training and testing sets. Typically, an 80-20 or 70-30 split is used.

b. Explain the importance of splitting the data and how it prevents overfitting.