### Data Cleaning

Data cleaning is the process of identifying and correcting (or removing) errors and inconsistencies in data to improve its quality. This is a crucial step in the data preparation phase before analysis or model training. Dirty data can significantly distort the outcomes of any analytical process, leading to inaccurate or misleading conclusions.

#### Common Data Cleaning Tasks
- Handling Missing Values: Filling in missing values, imputing data, or deciding to remove rows or columns with missing data.
- Removing Outliers: Identifying and removing the outliers that are not representative of the data distribution.
- Normalization and Scaling: Converting numerical features to a similar scale.
- Encoding Categorical Values: Transforming categorical variables into a format that can be used by machine learning algorithms, like one-hot encoding.

#### Importance of Data Cleaning
- Accuracy: Dirty data can lead to false conclusions and poor decisions.
- Efficiency: Clean data speeds up the data analysis process by reducing the volume of erroneous data that needs to be processed.
- Reliability: Consistent and high-quality data allows for more robust and reliable analytical models and algorithms.
- Compliance: Ensuring that data is clean and high-quality can be essential for meeting legal and governance standards.
- Integrity: Maintaining the integrity of datasets is crucial for any form of statistical or data-driven work.


Data cleaning is labor-intensive and complex task that often requires domain-specific knowledge. However, it is a critical component of this milestone project, as it directly impacts the quality of the insights. Numerous tools and libraries will be use to assist in automating various aspects of this cleaning process.

In [1]:
import pandas as pd
import numpy as np
import altair as alt
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import Normalize
from IPython.display import display, Markdown

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
pest_df = pd.read_csv('pesticides.csv')
rain_df = pd.read_csv('rainfall.csv')
temp_df = pd.read_csv('temp.csv')
yield_df = pd.read_csv('yield.csv')

In [3]:
# reading the dataset
def read_dataset(file_path):
    try:
        pest_df = pd.read_csv(file_path)
        return pest_df
    except Exception as e:
        return f'An error occured: {e}'
    
# reading the dataset
pest_df = read_dataset('pesticides.csv')
pest_df.head()

Unnamed: 0,Domain,Area,Element,Item,Year,Unit,Value
0,Pesticides Use,Albania,Use,Pesticides (total),1990,tonnes of active ingredients,121.0
1,Pesticides Use,Albania,Use,Pesticides (total),1991,tonnes of active ingredients,121.0
2,Pesticides Use,Albania,Use,Pesticides (total),1992,tonnes of active ingredients,121.0
3,Pesticides Use,Albania,Use,Pesticides (total),1993,tonnes of active ingredients,121.0
4,Pesticides Use,Albania,Use,Pesticides (total),1994,tonnes of active ingredients,201.0


# **Observations**


#### **Pesticides file**
- The dataset contains information about pesticide use across different years and areas (presumably countries).
#### Attributes:
 - Domain: Appears to indicate the subject area of the data, which is 'Pesticides Use' in this case.
 - Area: The geographical area, which looks like the name of the country.
 - Element: Specifies what the data represents, i.e., 'Use' of pesticides.
 - Item: The type of pesticide used. It is listed as 'Pesticides (total)' in the initial rows.
 - Year: The year in which the data was collected.
 - Unit: The unit of measurement, which is 'tonnes of active ingredients'.
 - Value: The actual value, representing the amount of pesticides used.

In [4]:
# reading the rainfall dataset
rain_df = read_dataset('rainfall.csv')
rain_df.head()

Unnamed: 0,Area,Year,average_rain_fall_mm_per_year
0,Afghanistan,1985,327
1,Afghanistan,1986,327
2,Afghanistan,1987,327
3,Afghanistan,1989,327
4,Afghanistan,1990,327


#### **Rain file**
 - Area: This seems to represent geographical areas, likely countries with a 0 missing value
 - Year: This represents the year for the data point also with a 0 missing value
 - average_rain_fall_mm_per_year: This appears to represent the average rainfall in mm per year for the given area and year.
 and has a missing values of 774
 - Data type inconsistencies can hinder numerical analyses and visualizations. The presence of string data in what should be a numeric column (average_rain_fall_mm_per_year) will need conversion to float or integer for meaningful statistical operations.
 - The broad scope in terms of years and areas provides a rich dataset but also introduces complexity when combining with other datasets. Data alignment and aggregation may be needed.

In [5]:
# reading the temperature dataset
temp_df = read_dataset('temp.csv')
temp_df.head()

Unnamed: 0,year,country,avg_temp
0,1849,Côte D'Ivoire,25.58
1,1850,Côte D'Ivoire,25.52
2,1851,Côte D'Ivoire,25.67
3,1852,Côte D'Ivoire,
4,1853,Côte D'Ivoire,


### **Temp.csv file**
- The dataset contains information about average annual temperature across different years and countries.
##### Attributes:
 - year: The year in which the data was collected and as 0 missing value
 - country: The name of the country and has 0 missing value.
 - avg_temp: The average annual temperature in presumably degrees Celsius (although the unit is not explicitly stated) with a missing values of 2,547.

In [6]:
# reading the yield dataset
yield_df = read_dataset('yield.csv')
yield_df.head()

Unnamed: 0,Domain Code,Domain,Area Code,Area,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Value
0,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1961,1961,hg/ha,14000
1,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1962,1962,hg/ha,14000
2,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1963,1963,hg/ha,14260
3,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1964,1964,hg/ha,14257
4,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1965,1965,hg/ha,14400


#### **Yield file**
- The dataset contains 56,717 records and 12 columns.
##### Attributes
 - Domain Code, Domain, Element Code, Element, and Unit: These columns contain only 1 unique value and are likely to be redundant for analysis.
 - Area Code and Area: Represent the geographical location with 212 unique areas.
 - Item Code and Item: Indicate the type of crop, with 10 unique crop types.
 - Year Code and Year: Represent the year data was collected, with 56 unique years.
 - Value: Represents crop yield and contains 36,815 unique values.

#### **Data Types**
- The data types are mixed: integer types (Area Code, Element Code, Item Code, Year Code, Year, Value) and object types (Domain Code, Domain, Area, Element, Item, Unit).
#### **Considerations for Analysis**
Redundant columns from each files and Unit may be dropped during data preprocessing to streamline the dataset.
Some columns contain string or object types and may need conversion to numerical types for analysis.
The datasets seems to have multiple columns that provide similar or identical information in different formats (e.g., "Domain Code" and "Domain", "Area Code" and "Area", "Element Code" and "Element", "Year Code" and "Year"). These may need to be streamlined during the data preprocessing stage.

In [7]:
# Define a function to count the number of unique values in each column of the dataset
def count_unique(df):
    """
    This function takes the DataFrame as an argument.
    It returns a dictionary containing the count of unique values for each column.
    
    Parameters:
    df (DataFrame): The DataFrame to analyze (DataFrame)
    
    Returns:
    unique_count (dict): A dictionary containing the count of unique values for each column
    """
    try:
        unique_count = df.nunique().to_dict()
        return unique_count
    except Exception as e:
        return f"An error occurred: {e}"

# Count the number of unique values in the Pest DataFrame
unique_pest = count_unique(pest_df)
unique_pest

{'Domain': 1,
 'Area': 168,
 'Element': 1,
 'Item': 1,
 'Year': 27,
 'Unit': 1,
 'Value': 2825}

#### **Findings**
For Number of Unique Values in Pesticides Dataset (pesticides_df)
- Domain: 1 unique value
- Area: 168 unique areas or countries
- Element: 1 unique value
- Item: 1 unique item (likely representing a specific type of pesticide or aggregate measure)
- Year: 27 unique years
- Unit: 1 unique unit
- Value: 2,825 unique pesticide usage values

The Pesticides dataset contains data from 168 different areas or countries and spans 27 unique years. It seems to focus on a specific type or aggregate measure of pesticide, as indicated by the single unique item. The dataset has 2,825 unique pesticide usage values, suggesting a range of pesticide practices across different regions and timeframes.

In [8]:
# Count the number of unique values in the Rain DataFrame
unique_rain = count_unique(rain_df)
unique_rain

{' Area': 217, 'Year': 31, 'average_rain_fall_mm_per_year': 173}

#### **Findings**
For Number of Unique Values in Rainfall Dataset (rainfall_df)
- Area: 217 unique areas or countries
- Year: 31 unique years
- Average_Rain_Fall_mm_per_Year: 173 unique average rainfall values per year

The Rainfall dataset contains data from 217 different areas or countries and spans 31 unique years. There are 173 unique values for average rainfall, indicating a variety of rainfall measurements across different regions and time periods

In [9]:
# Count the number of unique values in the Temperature DataFrame
unique_temp = count_unique(temp_df)
unique_temp

{'year': 271, 'country': 137, 'avg_temp': 3303}

#### **Findings**
For Number of Unique Values in Temperature Dataset (temp_df)
- Year: 271 unique years
- Country: 137 unique countries
- Avg_Temp: 3,303 unique average temperature values

The dataset spans 271 unique years and covers 137 different countries. The average temperature (avg_temp) has 3,303 unique values, which suggests a wide range of temperature data points across countries and years.

In [10]:
# Count the number of unique values in the Yield DataFrame
unique_yield = count_unique(yield_df)
unique_yield

{'Domain Code': 1,
 'Domain': 1,
 'Area Code': 212,
 'Area': 212,
 'Element Code': 1,
 'Element': 1,
 'Item Code': 10,
 'Item': 10,
 'Year Code': 56,
 'Year': 56,
 'Unit': 1,
 'Value': 36815}

#### **Findings**
For Number of Unique Values in Yield Dataset (yield_df)
- Domain Code: 1 unique value
- Domain: 1 unique value
- Area Code: 212 unique area codes
- Area: 212 unique areas or countries
- Element Code: 1 unique value
- Element: 1 unique value
- Item Code: 10 unique item codes (likely representing different crops)
- Item: 10 unique items (likely representing different crops)
- Year Code: 56 unique year codes
- ear: 56 unique years
- Unit: 1 unique unit
V- alue: 36,815 unique yield values

The Yield dataset contains data from 212 different areas or countries and spans 56 unique years. There are 10 unique crops represented. The yield values are quite diverse, with 36,815 unique measurements.
Many columns have only one unique value, indicating that they might not be very informative for the analysis.

### Renaming some columns

In [11]:
# Define a function to rename columns in the Pesticides DataFrame
def rename_columns(df):
    """
    This function takes the Pesticides DataFrame as an argument and renames specific columns.
    
    Parameters:
    df (DataFrame): The DataFrame to rename columns ( DataFrame)
    
    Returns:
    df (DataFrame): The DataFrame with renamed columns
    """
    try:
        df.rename(columns={'Area': 'Country', 'Value': 'Pesticides'}, inplace=True)
        return df
    except Exception as e:
        return f"An error occurred: {e}"

# Test the function to rename columns for the pesticides DataFrame
pest_df = rename_columns(pest_df)
pest_df.head()

Unnamed: 0,Domain,Country,Element,Item,Year,Unit,Pesticides
0,Pesticides Use,Albania,Use,Pesticides (total),1990,tonnes of active ingredients,121.0
1,Pesticides Use,Albania,Use,Pesticides (total),1991,tonnes of active ingredients,121.0
2,Pesticides Use,Albania,Use,Pesticides (total),1992,tonnes of active ingredients,121.0
3,Pesticides Use,Albania,Use,Pesticides (total),1993,tonnes of active ingredients,121.0
4,Pesticides Use,Albania,Use,Pesticides (total),1994,tonnes of active ingredients,201.0


In [12]:
rain_df.rename(columns={' Area': 'Country', 'average_rain_fall_mm_per_year': 'Average_Rainfall'}, inplace=True)
# Test the function to rename columns for the rain DataFrame
rain_df = rename_columns(rain_df)
rain_df.head()

Unnamed: 0,Country,Year,Average_Rainfall
0,Afghanistan,1985,327
1,Afghanistan,1986,327
2,Afghanistan,1987,327
3,Afghanistan,1989,327
4,Afghanistan,1990,327


In [13]:
temp_df.rename(columns={'year': 'Year', 'country': 'Country', 'avg_temp': 'Average_Temperature'}, inplace=True)
# Test the function to rename columns for the Temperature DataFrame
temp_df = rename_columns(temp_df)
temp_df.head()

Unnamed: 0,Year,Country,Average_Temperature
0,1849,Côte D'Ivoire,25.58
1,1850,Côte D'Ivoire,25.52
2,1851,Côte D'Ivoire,25.67
3,1852,Côte D'Ivoire,
4,1853,Côte D'Ivoire,


In [14]:
yield_df.rename(columns={'Area': 'Country', 'Value': 'Yield'}, inplace=True)
# Test the function to rename columns for the yield DataFrame
yield_df = rename_columns(yield_df)
yield_df.head()

Unnamed: 0,Domain Code,Domain,Area Code,Country,Element Code,Element,Item Code,Item,Year Code,Year,Unit,Yield
0,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1961,1961,hg/ha,14000
1,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1962,1962,hg/ha,14000
2,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1963,1963,hg/ha,14260
3,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1964,1964,hg/ha,14257
4,QC,Crops,2,Afghanistan,5419,Yield,56,Maize,1965,1965,hg/ha,14400


### Checking for duplicates values

In [15]:
# Define a function to check for duplicate rows in the DataFrame
def check_duplicates(df):
    """
    This function takes the Temperature DataFrame as an argument.
    It checks for duplicate rows and returns the number of duplicate rows found.
    
    Parameters:
    df (DataFrame): The DataFrame to check for duplicates (pesticides DataFrame)
    
    Returns:
    num_duplicates (int): The number of duplicate rows found
    """
    try:
        num_duplicates = df.duplicated().sum()
        return num_duplicates
    except Exception as e:
        return f"An error occurred: {e}"

# Check for duplicate rows in the Pesticides DataFrame
num_duplicates_pest = check_duplicates(pest_df)
num_duplicates_pest

0

In [16]:
# Check for duplicate rows in the rain DataFrame
num_duplicates_rain = check_duplicates(rain_df)
num_duplicates_rain

0

In [17]:
# Check for duplicate rows in the Temperature DataFrame
num_duplicates_temp = check_duplicates(temp_df)
num_duplicates_temp

6958

The temperature dataset contains 6,958 duplicate rows. Depending on the type of analysis we're conducting, these duplicates could skew our results. We may need to investigate why these duplicates exist and decide whether to keep them, remove them, or take some other action

In [18]:
# Check for duplicate rows in the Temperature DataFrame
num_duplicates_yield = check_duplicates(yield_df)
num_duplicates_yield

0

### Data Integrity Check

In [19]:
# checking for missing values
def print_missing_values(df, name):
    print(f"Missing values in {name}:")
    print('_____________________________')
    print(df.isna().sum())
    print('\n')

print_missing_values(temp_df, 'temp_df')
print_missing_values(rain_df, 'rain_df')
print_missing_values(pest_df, 'pest_df')
print_missing_values(yield_df, 'yield_df')

Missing values in temp_df:
_____________________________
Year                      0
Country                   0
Average_Temperature    2547
dtype: int64


Missing values in rain_df:
_____________________________
Country               0
Year                  0
Average_Rainfall    774
dtype: int64


Missing values in pest_df:
_____________________________
Domain        0
Country       0
Element       0
Item          0
Year          0
Unit          0
Pesticides    0
dtype: int64


Missing values in yield_df:
_____________________________
Domain Code     0
Domain          0
Area Code       0
Country         0
Element Code    0
Element         0
Item Code       0
Item            0
Year Code       0
Year            0
Unit            0
Yield           0
dtype: int64




## **Findings**
#### Rain file
1. Missing Values: The column average_rain_fall_mm_per_year has 774 missing values.
2. Duplicates: There are no duplicate rows in the dataset.
3. Data Types:
   - Area is of object (string) type, which is expected.
   - Year is an integer, which is also expected.
   - average_rain_fall_mm_per_year is of object type, which is not expected for a numerical feature. This suggests that some non-numeric characters might be present in this column.

#### Temp file
A total of 2,547 is missing in avg_temp this will need to be addressed. The strategy for this will depend on the nature of these missing values. Whether they are missing completely at random, missing at random, or missing not at random.

#### **Handing Missing Values**
Handling missing values in a dataset is crucial for ensuring that the data is accurate, complete, and suitable for analysis or machine learning, thereby resulting in more reliable outcomes. Secondly, handling missing values helps improve data quality, enhances data visualization etc. By adaquately handling these missing values in this milestone will pave way for more accurate and insightful data analysis.

In [20]:
# Handle missing values in the Rainfall DataFrame using mean
def handle_missing_values(df):
    """
    This function takes the Rainfall DataFrame as an argument.
    It replaces missing values in numerical columns with their mean.
    
    Parameters:
    df (DataFrame): The DataFrame to handle missing values (Rainfall DataFrame)
    
    Returns:
    df (DataFrame): The DataFrame with missing values replaced by mean
    """
    try:
        # Convert the 'Average_Rainfall' column to numeric, coercing errors to NaN for replacement
        df['Average_Rainfall'] = pd.to_numeric(df['Average_Rainfall'], errors='coerce')
        
        for col in df.select_dtypes(include=['float64', 'int64']).columns:
            df[col].fillna(df[col].mean(), inplace=True)
        return df
    except Exception as e:
        return f"An error occurred: {e}"

# Handle missing values in the Rainfall DataFrame
rainfall_df = handle_missing_values(rain_df)
rainfall_df.isnull().sum()


Country             0
Year                0
Average_Rainfall    0
dtype: int64

In [21]:
def handle_missing_values_temp(df):
    """
    This function takes the Temperature DataFrame as an argument.
    It replaces missing values in numerical columns with their mean.
    
    Parameters:
    df (DataFrame): The DataFrame to handle missing values (Temperature DataFrame)
    
    Returns:
    df (DataFrame): The DataFrame with missing values replaced by mean
    """
    try:
        for col in df.select_dtypes(include=['float64', 'int64']).columns:
            df[col].fillna(df[col].mean(), inplace=True)
        return df
    except Exception as e:
        print(f"An error occurred: {e}")
        return df  # Returning the original DataFrame

# Now, run the function on the dataset
handled_temp_df = handle_missing_values_temp(temp_df)

# Check for missing values after handling them
handled_temp_df.isnull().sum()


Year                   0
Country                0
Average_Temperature    0
dtype: int64

### Purpose of Using the Mean:

Using the mean to impute missing values is a common technique in data analysis. The mean provides a central tendency of the data, effectively balancing out the impact of extremely high or low values. By replacing missing values with the mean, we maintain the dataset's overall statistical properties, such as the mean and variance, thereby minimizing the distortion that missing data could otherwise introduce into our analysis.

### Removing Redundant Columns

Redundant columns are those that do not add any informational value to the dataset. These columns will be removed to streamline the dataset and make subsequent analyses more efficient.

In [22]:
# Droping redundant columns from the Yield DataFrame
def drop_redundant_columns(df):
    """
    This function takes the Yield DataFrame as an argument.
    It drops redundant or unnecessary columns.
    
    Parameters:
    df (DataFrame): The DataFrame from which to drop columns (Yield DataFrame)
    
    Returns:
    df (DataFrame): The DataFrame with redundant columns dropped
    """
    try:
        # Drop columns that are not useful for our analysis
        columns_to_drop = ['Domain Code', 'Domain', 'Area Code', 'Element Code', 'Element', 'Item Code', 'Year Code', 'Unit']
        df.drop(columns=columns_to_drop, inplace=True, errors='ignore')
        return df
    except Exception as e:
        return f"An error occurred: {e}"
    
# Test one of the functions to drop redundant columns (for the Yield DataFrame)  
yield_df = drop_redundant_columns(yield_df)
yield_df.head()

Unnamed: 0,Country,Item,Year,Yield
0,Afghanistan,Maize,1961,14000
1,Afghanistan,Maize,1962,14000
2,Afghanistan,Maize,1963,14260
3,Afghanistan,Maize,1964,14257
4,Afghanistan,Maize,1965,14400


In [23]:
# Drop columns that are not useful for our analysis
columns_to_drop = ['Domain', 'Element', 'Item', 'Unit']
pest_df.drop(columns=columns_to_drop, inplace=True, errors='ignore')
# Test one of the functions to drop redundant columns (for the Yield DataFrame)
pest_df = drop_redundant_columns(pest_df)
pest_df.head()

Unnamed: 0,Country,Year,Pesticides
0,Albania,1990,121.0
1,Albania,1991,121.0
2,Albania,1992,121.0
3,Albania,1993,121.0
4,Albania,1994,201.0


In [28]:
def merge_datasets(temp_df, rain_df, yield_df, pest_df):
    """
    This function takes the four DataFrames as arguments.
    It merges them based on the common columns 'Country' and 'Year'.
    
    Returns:
    DataFrame: The DataFrame resulting from merging the four DataFrames
    """
    try:
        # Merge Temperature and Rainfall DataFrames
        merged_1 = pd.merge(temp_df, rain_df, how='inner', on=['Country', 'Year'])
        print("Successfully merged Temperature and Rainfall DataFrames.")
        
        # Merge the above DataFrame (merged_1) with the Yield DataFrame
        merged_2 = pd.merge(merged_1, yield_df, how='inner', on=['Country', 'Year'])
        print("Successfully merged with Yield DataFrame.")
        
        # Merge the above DataFrame (merged_2) with the Pesticides DataFrame
        temporal_merged_df = pd.merge(merged_2, pest_df, how='inner', on=['Country', 'Year'])
        print("Successfully merged with Pesticides DataFrame.")
        
        return temporal_merged_df
    
    except Exception as e:
        print(f"An error occurred while merging: {e}")
        return None

# Assuming temp_df, rain_df, yield_df, and pest_df are your DataFrames
temporal_merged_df = merge_datasets(temp_df, rain_df, yield_df, pest_df)

# Uncomment the following line only after you've loaded your DataFrames
print(temporal_merged_df.head())

Successfully merged Temperature and Rainfall DataFrames.
Successfully merged with Yield DataFrame.
Successfully merged with Pesticides DataFrame.
   Year Country  Average_Temperature  Average_Rainfall                  Item  \
0  1990   Ghana                26.73            1187.0               Cassava   
1  1990   Ghana                26.73            1187.0                 Maize   
2  1990   Ghana                26.73            1187.0  Plantains and others   
3  1990   Ghana                26.73            1187.0           Rice, paddy   
4  1990   Ghana                26.73            1187.0               Sorghum   

   Yield  Pesticides  
0  84170        65.8  
1  11889        65.8  
2  61890        65.8  
3  16510        65.8  
4   6310        65.8  


In [29]:
temporal_merged_df.head()

Unnamed: 0,Year,Country,Average_Temperature,Average_Rainfall,Item,Yield,Pesticides
0,1990,Ghana,26.73,1187.0,Cassava,84170,65.8
1,1990,Ghana,26.73,1187.0,Maize,11889,65.8
2,1990,Ghana,26.73,1187.0,Plantains and others,61890,65.8
3,1990,Ghana,26.73,1187.0,"Rice, paddy",16510,65.8
4,1990,Ghana,26.73,1187.0,Sorghum,6310,65.8


In [31]:
# Saving the merged DataFrame to a CSV file
csv_file_path = 'temporal_merged_data.csv'
temporal_merged_df.to_csv(csv_file_path, index=False)

csv_file_path

'temporal_merged_data.csv'