<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/Python-Notebook-Banners/Code_challenge.png"  style="display: block; margin-left: auto; margin-right: auto;";/>
</div>

# Integrated project: Validating our data
© ExploreAI Academy

In this Code Challenge we’re diving into the agricultural dataset again to continue to validate our data. Before we do that, we’re pausing to build a data pipeline that will ingest and clean our data with the press of a button, cleaning up our code significantly. Once that’s ready, we’ll complete our data validation.

# Introduction

Here we are diving right into the code without outlining the step by step process.

So what's the plan? 
1. Create a null hypothesis.
1. Import the `MD_agric_df` dataset and clean it up.
1. Import the weather data.
1. Map the weather data to the field data.
1. Calculate the means of the weather station dataset and the means of the main dataset.
2. Calculate all the parameters we need to do a t-test. 
3. Interpret our results.

# Validating the dataset

So we finally have working modules that now automatically pull data from the database  (or the web), process it, clean it, and return our starting DataFrame. Before we jump in and analyse the data, let's pause for a second and ask: Did the changes actually get applied? Did we correct the elevation data, did we rename the columns? We could go back to the old ways, and create queries to check, but a better way is to **test our dataset**. 

Let's get the data in first. Remember to use your `config_params` dictionary.

**Imports**

In [1]:
import re
import logging 
import numpy as np
import pandas as pd

from scripts.field_data_processor import FieldDataProcessor
from scripts.weather_data_processor import WeatherDataProcessor

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

**Parameters**

In [2]:
weather_station_df = pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Maji_Ndogo/Weather_station_data.csv")
weather_station_mapping_df = pd.read_csv("https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Maji_Ndogo/Weather_data_field_mapping.csv")

config_params = {
    "sql_query": """SELECT *
        FROM geographic_features
        LEFT JOIN weather_features USING (Field_ID)
        LEFT JOIN soil_and_crop_features USING (Field_ID)
        LEFT JOIN farm_management_features USING (Field_ID)
        """, 
    "db_path": 'sqlite:///database/Maji_Ndogo_farm_survey_small.db', # Insert the db_path of the database
    "columns_to_rename": {'Annual_yield': 'Crop_type', 'Crop_type': 'Annual_yield'},# Insert the disctionary of columns we want to swop the names of
    "values_to_rename": {'cassaval': 'cassava', 'wheatn': 'wheat', 'teaa': 'tea'}, # Insert the croptype renaming dictionary
    "weather_csv_path": 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Maji_Ndogo/Weather_station_data.csv', # Insert the weather data CSV here
    "weather_mapping_csv": 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Maji_Ndogo/Weather_data_field_mapping.csv', # Insert the weather data mapping CSV here   # Add two new keys
    "weather_csv_path": 'https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Maji_Ndogo/Weather_station_data.csv', # Insert the URL for the weather station data
    "regex_patterns" : {
        'Rainfall': r'(\d+(\.\d+)?)\s?mm',
        'Temperature': r'(\d+(\.\d+)?)\s?C',
        'Pollution_level': r'=\s*(-?\d+(\.\d+)?)|Pollution at \s*(-?\d+(\.\d+)?)'
    }
}

**Check if our file executes correctly**

Before we actually get to the analysis part, take a moment to notice how much simpler this data import is now. It feels like a lot of work, but now one cell of code imports and cleans all of our data.

Let's get the data in first. Remember to use your `config_params` dictionary.

In [3]:
config_params = config_params

field_processor = FieldDataProcessor(config_params)
field_processor.process()
field_df = field_processor.df

weather_processor = WeatherDataProcessor(config_params)
weather_processor.process()
weather_df = weather_processor.weather_df

# Rename 'Ave_temps' in field_df to 'Temperature' to match weather_df
field_df.rename(columns={'Ave_temps': 'Temperature'}, inplace=True)

weather_df['Measurement'].unique()

2024-03-01 18:55:59,503 - data_ingestion - INFO - Database engine created successfully.
2024-03-01 18:55:59,697 - data_ingestion - INFO - Query executed successfully.
2024-03-01 18:55:59,698 - scripts.field_data_processor.FieldDataProcessor - INFO - Successfully loaded data.
2024-03-01 18:55:59,700 - scripts.field_data_processor.FieldDataProcessor - INFO - Swapped columns: Annual_yield with Crop_type
2024-03-01 18:56:04,807 - data_ingestion - INFO - CSV file read successfully from the web.
2024-03-01 18:56:07,370 - data_ingestion - INFO - CSV file read successfully from the web.
2024-03-01 18:56:07,374 - scripts.weather_data_processor.WeatherDataProcessor - INFO - Successfully loaded weather station data from the web.
2024-03-01 18:56:07,486 - scripts.weather_data_processor.WeatherDataProcessor - INFO - Messages processed and measurements extracted.
2024-03-01 18:56:07,487 - scripts.weather_data_processor.WeatherDataProcessor - INFO - Data processing completed.


array(['Temperature', 'Pollution_level', 'Rainfall'], dtype=object)

**Expected output:**

```python
<Timestamp> - data_ingestion - INFO - Database engine created successfully.
<Timestamp> - data_ingestion - INFO - Query executed successfully.
<Timestamp> - scripts.field_data_processor.FieldDataProcessor - INFO - Successfully loaded data.
<Timestamp> - scripts.field_data_processor.FieldDataProcessor - INFO - Swapped columns: Annual_yield with Crop_type
<Timestamp> - data_ingestion - INFO - CSV file read successfully from the web.
<Timestamp> - data_ingestion - INFO - CSV file read successfully from the web.
<Timestamp> - scripts.weather_data_processor.WeatherDataProcessor - INFO - Successfully loaded weather station data from the web.
<Timestamp> - scripts.weather_data_processor.WeatherDataProcessor - INFO - Messages processed and measurements extracted.
<Timestamp> - scripts.weather_data_processor.WeatherDataProcessor - INFO - Data processing completed.

array(['Temperature', 'Pollution_level', 'Rainfall'], dtype=object)
```

### Validating our data pipeline

Before we jump in and analyse the data, let's pause for a second and ask: Did the changes actually get applied? Did we correct the elevation data, did we rename the columns? We could go back to the old ways, and create queries to check, but a better way is to **test our dataset**. 

There should be a `validate_data.py` file in the notebook directory. This is a `pytest` script that does a couple of tests to see if the data we're expecting, is what we actually have. Have a look at the test script, and try to understand what we're testing.

`pytest` normally runs from the command line because it is set up to be automated. To test the data, we have to give `pytest` access to that data. The simplest way to do this is by creating CSV files, importing them into `validate_data.py`, and running the tests.

The following code creates CSV files, runs `pytest` in the terminal using `!pytest validate_data.py -v`, and deletes the CSV files once the test is complete.

In [4]:
# !pip install pytest
weather_df.to_csv('sampled_weather_df.csv', index=False)
field_df.to_csv('sampled_field_df.csv', index=False)

!pytest test/validate_data.py -v

import os# Define the file paths
weather_csv_path = 'sampled_weather_df.csv'
field_csv_path = 'sampled_field_df.csv'

# Delete sampled_weather_df.csv if it exists
if os.path.exists(weather_csv_path):
    os.remove(weather_csv_path)
    print(f"Deleted {weather_csv_path}")
else:
    print(f"{weather_csv_path} does not exist.")

# Delete sampled_field_df.csv if it exists
if os.path.exists(field_csv_path):
    os.remove(field_csv_path)
    print(f"Deleted {field_csv_path}")
else:
    print(f"{field_csv_path} does not exist.")

platform win32 -- Python 3.8.18, pytest-8.0.2, pluggy-1.4.0 -- C:\ProgramData\anaconda3\envs\sql\python.exe
cachedir: .pytest_cache
rootdir: C:\Users\Pauline PC\Documents\ALXdata\python\integrated_project
plugins: anyio-4.2.0
[1mcollecting ... [0mcollected 9 items

test/validate_data.py::test_read_weather_DataFrame_shape [32mPASSED[0m[32m          [ 11%][0m
test/validate_data.py::test_read_field_DataFrame_shape [32mPASSED[0m[32m            [ 22%][0m
test/validate_data.py::test_weather_dataframe_columns [32mPASSED[0m[32m             [ 33%][0m
test/validate_data.py::test_field_dataframe_columns [32mPASSED[0m[32m               [ 44%][0m
test/validate_data.py::test_crop_types_are_valid [32mPASSED[0m[32m                  [ 55%][0m
test/validate_data.py::test_field_dataframe_non_negative_elevation [32mPASSED[0m[32m [ 66%][0m
test/validate_data.py::test_positive_rainfall_values [32mPASSED[0m[32m              [ 77%][0m
test/validate_data.py::test_weather_dataframe

**Expected output:**

```python
============================ test session starts =============================
platform win32 -- Python 3.12.1, pytest-8.0.0, pluggy-1.4.0 -- ...
cachedir: .pytest_cache
rootdir: ...
plugins: anyio-4.2.0
collecting ... collected 9 items

test/validate_data.py::test_read_weather_DataFrame_shape PASSED          [ 11%]
test/validate_data.py::test_read_field_DataFrame_shape PASSED            [ 22%]
test/validate_data.py::test_weather_dataframe_columns PASSED             [ 33%]
test/validate_data.py::test_field_dataframe_columns PASSED               [ 44%]
test/validate_data.py::test_crop_types_are_valid PASSED                  [ 55%]
test/validate_data.py::test_field_dataframe_non_negative_elevation PASSED [ 66%]
test/validate_data.py::test_positive_rainfall_values PASSED              [ 77%]
test/validate_data.py::test_weather_dataframe_not_empty PASSED           [ 88%]
test/validate_data.py::test_field_dataframe_not_empty PASSED             [100%]

============================== warnings summary ===============================
..\..\..\..\..\..\..\..\anaconda3\envs\Latest\Lib\site-packages\dateutil\tz\tz.py:37
  ...: DeprecationWarning: datetime.datetime.utcfromtimestamp() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.fromtimestamp(timestamp, datetime.UTC).
    EPOCH = datetime.datetime.utcfromtimestamp(0)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html

============================= 9 passed in 38.85s ==============================

Deleted sampled_weather_df.csv
Deleted sampled_field_df.csv
```

>⚠️ Depending on the version of Python, there may be various warnings like the one above. These are normally `DeprecationWarnings` so we can safely ignore these for now. We're interested in whether all the dataset tests passed. 

Great! Now we know our data resembles what we expect! As our project evolves we may have to add more module functionality or create more rigorous tests of the data. 


Ok, now we can circle back to the start. As I mentioned, setting a tolerance might have been a simple way to measure if our field data and weather data agree, but we didn't take into account if either dataset was spread out.

I hope you have some idea of the problem, but we need to tell this story anyway, so hopefully, I can convince you I made an error last time, by the time we're done.

Back to our initial plan:

1. Create a null hypothesis.
1. Import the `MD_agric_df` dataset and clean it up.
1. Import the weather data.
1. Map the weather data to the field data.
1. Calculate the means of the weather station dataset and the means of the main dataset.
2. Calculate all the parameters we need to do a t-test. 
3. Interpret our results.

## Hypothesis

So what are we testing with our null hypothesis $H_0$? Well, we want to know if our field data is representing the reality in Maji Ndogo by looking at an independent set of data. If our field data (means) are the same as the weather data (means), then it indicates no significant difference between the datasets. We're essentially saying that any difference we see between these means is because of randomness. However, if the means differ significantly, we'll know there is a reason for it, and that it is not just a random fluctuation in the data. 

<br>

Given a significance level $\alpha$ of 0.05 for a two-tailed test, we have the following conditions for our hypothesis test at a 95% confidence interval:

- $H_0$: There is no significant difference between the means of the two datasets. This is expressed as $\mu_{field} = \mu_{weather}$.

- $H_a$: There is a significant difference between the means of the two datasets. This is expressed as $\mu_{field} \neq \mu_{weather}$.

<br>

If the p-value obtained from the test:
- is less than or equal to the significance level, so $p \leq \alpha$, we reject the null hypothesis.
- is larger than the significance level, so $p > \alpha$, we cannot reject the null hypothesis, as we cannot find a statistically significant difference between the datasets at the 95% confidence level.

Now, let's code it out. 

First, we're going to import all of the packages and define a few variables. You might notice we're importing a new method, `.ttest_ind()`. This method takes in two data columns and calculates means, variance, and returns the the t- and p-statistics. So our t-test is reduced to one line. Since our alternative hypothesis does not make a claim of greater or less than, we will use the two-sided t-test, by adding  the `alternative = 'two-sided'` keyword.

In [5]:
from scipy.stats import ttest_ind
import numpy as np

# Now, the measurements_to_compare can directly use 'Temperature', 'Rainfall', and 'Pollution_level'
measurements_to_compare = ['Temperature', 'Rainfall', 'Pollution_level']

Let's pause for a second and clarify what exactly we're comparing. 

We want to compare the means of the temperature, rainfall, and pollution data, for fields assigned to a specific weather station. So for both datasets, we need to isolate the measurement type and weather station for each data, so we're comparing the correct means.

Let's break down what we need to do:
1. We need to filter both `field_df` and `weather_df` based on the given station ID and measurement. We can use `filter_field_data(df, station_id, measurement)` and `filter_weather_data(df, station_id, measurement)`.  
2. We need to perform a t-test to conduct the t-test on the filtered data. So we're going to use `ttest_ind(data_col1, data_col2, equal_var=False)` from `scipy.stats`.
3. `print_ttest_results(station_id, measurement, p_val, alpha)` to interpret and print the results from the t-test.

We'll first define these functions, focusing on `Temperature` for `station ID = 0`. Then, we'll integrate these functions into a loop that iterates over each station ID and measurement type.

<br> 

⚙️ **Task:** Create a `filter_field_data` function that takes in the `field_df` DataFrame, the `station_id`, and `measurement` type, and retuns a **single column** (series) of data filtered by the `station_id`, and `measurement`.

In [6]:
## START FUNCTION
def filter_field_data(df, station_id, measurement):
    """
    Filters field data based on station_id and measurement.

    Args:
    - df (pandas.DataFrame): The DataFrame containing field data.
    - station_id (str): The ID of the weather station.
    - measurement (str): The type of measurement to filter.

    Returns:
    - pandas.Series: A single column (series) of data filtered by station_id and measurement.
    """
    # Filter on players older than 30 and overall rating greater than 90.
    df = df[df['Weather_station'] == station_id][measurement]
    return df
### END FUNCTION

<br>

**Input 1:**

In [7]:
# Example for station ID 0 and Temperature
station_id = 0
alpha = 0.05
measurement = 'Temperature'

# Filter data for the specific station and measurement
field_values = filter_field_data(field_df, station_id, measurement)
field_values

1       13.35
2       13.30
8       12.80
10      13.70
14      13.35
        ...  
5627    13.30
5630    14.25
5632    11.00
5638    13.30
5642    12.85
Name: Temperature, Length: 1375, dtype: float64

**Expected outcome:**

```python
1       13.35
2       13.30
8       12.80
10      13.70
14      13.35
        ...  
5627    13.30
5630    14.25
5632    11.00
5638    13.30
5642    12.85
Name: Temperature, Length: 1375, dtype: float64
```

<br>

**Input 2:**

In [8]:
# Example for station ID 0 and Temperature
station_id = 0
alpha = 0.05
measurement = 'Temperature'

# Filter data for the specific station and measurement
field_values = filter_field_data(field_df, station_id, measurement)
print(f"Shape: {field_values.shape}, First value: {field_values.iloc[0]} ")

Shape: (1375,), First value: 13.35 


**Expected outcome:**

`Shape: (1375,), First value: 13.35 `

<br> 

⚙️ **Task:** Create a data filter function that takes in the `weather_df` DataFrame, the `station_id`, and `measurement` type, and returns a **single column** (series) of data filtered by the `station_id`, and `measurement`.

In [9]:
### START FUNCTION

def filter_weather_data(df, station_id, measurement):
    """
    Filters a weather DataFrame based on the given station_id and measurement.

    Parameters:
    - df (pd.DataFrame): The input weather DataFrame.
    - station_id (int): The station ID to filter by.
    - measurement (str): The measurement type to filter by.

    Returns:
    - pd.Series: A single column (series) of data filtered by the station_id and measurement.
    """
    df = df[(df['Weather_station_ID'] == station_id) & (df['Measurement'] == measurement)]['Value']
    # print(df.head(2))
    return df

### END FUNCTION

<br> 

**Input 1:**

In [104]:
# Example for station ID 0 and Temperature
station_id = 0
alpha = 0.05
measurement = 'Temperature'

# Filter data for the specific station and measurement

weather_values = filter_weather_data(weather_df, station_id, measurement)
weather_values


0       12.82
2       14.53
29      14.28
32      12.87
67      13.13
        ...  
1804    12.77
1805    14.13
1817    13.14
1833    14.14
1834    13.61
Name: Value, Length: 100, dtype: float64

**Expected outcome:**

```python
0       12.82
2       14.53
29      14.28
32      12.87
67      13.13
        ...  
1804    12.77
1805    14.13
1817    13.14
1833    14.14
1834    13.61
Name: Value, Length: 100, dtype: float64
```

<br> 

**Input 2:**

In [105]:
# Example for station ID 0 and Temperature
station_id = 0
alpha = 0.05
measurement = 'Temperature'

# Filter data for the specific station and measurement

weather_values = filter_weather_data(weather_df, station_id, measurement)

print(f"Shape: {weather_values.shape}, First value: {weather_values.iloc[0]}")

Shape: (100,), First value: 12.82


**Expected outcome:**

`Shape: (100,), First value: 12.82 `

⚙️ **Task:** Create a function that calculates the t-statistic and p-value. The function should accept two **single columns** of data and return a tuple of the t-statistic and p-value.

In [106]:
### START FUNCTION
def run_ttest(Column_A, Column_B):
    """
    Calculates the t-statistic and p-value for two sets of data.

    Parameters:
    - Column_A (pd.Series or iterable): First set of data.
    - Column_B (pd.Series or iterable): Second set of data.

    Returns:
    - tuple: A tuple containing the t-statistic and p-value.
    """
    t_statistic, p_value = ttest_ind(Column_A, Column_B, equal_var=False)
    return t_statistic, p_value
### END FUNCTION

<br> 

**Input:**

In [107]:
# Example for station ID 0 and Temperature
station_id = 0
alpha = 0.05
measurement = 'Temperature'

# Filter data for the specific station and measurement
field_values = filter_field_data(field_df, station_id, measurement)
weather_values = filter_weather_data(weather_df, station_id, measurement)

# Perform t-test
t_stat, p_val = run_ttest(field_values, weather_values)
print(f"T-stat: {t_stat:.5f}, p-value: {p_val:.5f}")

T-stat: -0.11632, p-value: 0.90761


**Expected outcome:**

`T-stat: -0.11632, p-value: 0.90761`

<br>

⚙️ **Task:** Replace the **\<MISSING CODE>** to print out the t-test result.

In [108]:
### START FUNCTION

def print_ttest_results(station_id, measurement, p_val, alpha):
    """
    Interprets and prints the results of a t-test based on the p-value.
    """
    if p_val <= alpha:
        print(f"   Significant difference in {measurement} detected at Station  {station_id}, (P-Value: {p_val:.5f} < {alpha}). Null hypothesis rejected.")
    else:
        print(f"   No significant difference in {measurement} detected at Station  {station_id}, (P-Value: {p_val:.5f} > {alpha}). Null hypothesis not rejected.")

### END FUNCTION

**Input:**

In [109]:
# Example for station ID 0 and Temperature
station_id = 0

measurement = 'Temperature'

# Filter data for the specific station and measurement
field_values = filter_field_data(field_df, station_id, measurement)
weather_values = filter_weather_data(weather_df, station_id, measurement)

# Perform t-test
t_stat, p_val = run_ttest(field_values, weather_values)
print_ttest_results(station_id, measurement, p_val, alpha)

   No significant difference in Temperature detected at Station  0, (P-Value: 0.90761 > 0.05). Null hypothesis not rejected.


**Expected outcome:**

`No significant difference in Temperature detected (P-Value: 0.90761 > 0.05). Null hypothesis not rejected.`

Now we can put it all together in a loop.

<br>

⚙️ **Task:** Create a function that loops over `measurements_to_compare` and all `station_id`, perform a t-test and print the results. The function should accept `field_df`, `weather_df`, `list_measurements_to_compare`, `alpha`. the value of `alpha` should default to a value of 0.05. Hint: use `print_ttest_results()`.

In [110]:
### START FUNCTION
def hypothesis_results(field_df, weather_df, list_measurements_to_compare, alpha = 0.05):
    """
    Performs t-tests on specified measurements for each weather station and prints the results.

    Parameters:
    - field_df (pd.DataFrame): DataFrame containing field data.
    - weather_df (pd.DataFrame): DataFrame containing weather data.
    - list_measurements_to_compare (list): List of measurement types to compare.
    - alpha (float, optional): Significance level. Default is 0.05.

    Returns:
    - None
    """
    for station_id in sorted(weather_df['Weather_station_ID'].unique()):
        for measurement in list_measurements_to_compare:
            field_values = filter_field_data(field_df, station_id, measurement)
            weather_values = filter_weather_data(weather_df, station_id, measurement)
            t_statistic, p_value = run_ttest(field_values, weather_values)
            print_ttest_results(station_id, measurement, p_value, alpha)
### END FUNCTION

**Input:**

In [111]:
alpha = 0.05
hypothesis_results(field_df, weather_df, measurements_to_compare, alpha)

   No significant difference in Temperature detected at Station  0, (P-Value: 0.90761 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station  0, (P-Value: 0.21621 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station  0, (P-Value: 0.56418 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station  1, (P-Value: 0.47241 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station  1, (P-Value: 0.54499 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station  1, (P-Value: 0.24410 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station  2, (P-Value: 0.88671 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station  2, (P-Value: 0.36466 > 0.05). Null hypothesis not rejected.
 

**Expected outcome:**
```python 
   No significant difference in Temperature detected at Station 0, (P-Value: 0.90761 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station 0, (P-Value: 0.21621 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station 0, (P-Value: 0.56418 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station 1, (P-Value: 0.47241 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station 1, (P-Value: 0.54499 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station 1, (P-Value: 0.24410 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station 2, (P-Value: 0.88671 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station 2, (P-Value: 0.36466 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station 2, (P-Value: 0.99388 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station 3, (P-Value: 0.66445 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station 3, (P-Value: 0.39847 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station 3, (P-Value: 0.15466 > 0.05). Null hypothesis not rejected.
   No significant difference in Temperature detected at Station 4, (P-Value: 0.88575 > 0.05). Null hypothesis not rejected.
   No significant difference in Rainfall detected at Station 4, (P-Value: 0.33237 > 0.05). Null hypothesis not rejected.
   No significant difference in Pollution_level detected at Station 4, (P-Value: 0.21508 > 0.05). Null hypothesis not rejected.
   ```

Great! There we go. For all of our measurements the p-value > alpha, so there is not enough evidence to reject the null hypothesis. This means we have no evidence to suggest that the weather data is different from the field data. This makes us confident that our field data, at least in terms of temperature, rainfall, and pollution level is reflecting the reality. 

Why was this important? Well, we saw from the EDA that there were some relationships, and possible correlations with the standard yield, but we really can't say what affects a crop's success, because all of them seemed to. In a sense, we as humans could not clearly see the relationships, if we were given a set of conditions like rainfall, pH, and crop type, we could not reliably estimate what the standard yield of a crop is, because the relationships are hard to understand.

So our next step is to allow a machine to look for patterns, which is Machine Learning (ML). Computers are not limited to three dimensions, can calculate for hours, and find hidden patterns we cannot. Machine learning follows the basic principle across computational domains; junk in, junk out. We needed to make sure that the data we're feeding into ML models is accurate. Now we know, and we're ready for the next step. 

You must have been itching to get into AI, so we'll dive in soon.

Until then, look after yourself!
Saana

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>