# Data Science Assignment

## 0. Introduction
In this assignment, you will work on a crop recommendation dataset to practice and test your skills in data science.
Follow the steps outlined below to complete the assignment.

#### Context
Precision agriculture is in trend nowadays. It helps the farmers to get informed decision about the farming strategy. Here, I present you a dataset which would allow the users to build a predictive model to recommend the most suitable crops to grow in a particular farm based on various parameters.

This dataset was build by augmenting datasets of rainfall, climate and fertilizer data available for India.

#### Data fields
- N - ratio of Nitrogen content in soil
- P - ratio of Phosphorous content in soil
- K - ratio of Potassium content in soil
- temperature - temperature in degree Celsius
- humidity - relative humidity in %
- ph - ph value of the soil
- rainfall - rainfall in mm

## 1. Data Processing
Load the dataset and check for missing values or any inconsistencies. Perform necessary data cleaning tasks if required.


### Task: 1
- Read and the dataset using pandas library

In [1]:
# Your code here to load and process the data
import pandas as pd

# Load dataset
crop_data = pd.read_csv('<dataset-here>')

### Task 2: 

Find how many rows are there in the dataset

### Task 3: 

Find rows that contain missing values, if any
Print all the columns that contains null/ missing values

### Task 4: 

Find min, max, mean, median, and standard deviation of all the columns in this dataset

## 2. Data Pre-Processing

**What is the purpose of data pre-processing?**

Data pre-processing is the process of preparing and cleaning raw data to make it suitable for analysis. It involves tasks such as data cleaning, transformation, and feature engineering.

### Task 5: 

1. if the column is numerics, replace them with the mean of the column.
2. if they column is categorical, replace them with the maximim occuring value of the column.

## 3. Feature Engineering
Analyze the features available in the dataset and perform feature engineering if needed. This may include creating new features or transforming existing ones.





### Task 6: 

1. Apply Binning/Discretization Feature Engineering Techniques on N - ratio of Nitrogen content in soil feature
2. Find which bin (low, medium, high) has the most number of observations
3. Find which bin (low, medium, high) has the least number of observations

**Description:** Divide the continuous values of Nitrogen content into discrete bins (e.g., low, medium, high). This can be helpful if specific ranges of nitrogen levels correlate better with certain crop types.

### Task 7: 


1. Apply Normalization/Scaling on P - ratio of Phosphorous content in soil - kg/ha feature
2. explain the gap between 0.6 and 0.8 in the above visual

**Description:** Scale the nitrogen content values to a range, such as [0, 1] or standardize to have a mean of 0 and a standard deviation of 1. This is particularly useful for algorithms that are sensitive to feature magnitude, like neural networks or gradient-based methods.

The gap between 0.6 and 0.8 in the histogram of the scaled Phosphorous content (P) suggests that there are no data points in this range after scaling. This could happen for a few reasons:

Original Data Distribution:

The original values of the Phosphorous content (P) in the dataset might not have had any values that, when scaled, would fall between the 0.6 and 0.8 range. It’s possible that the original data had clusters or gaps that resulted in this range being empty when the scaling was applied.
Discrete or Limited Data Values:

If the original data for Phosphorous content has a limited set of unique values or is clustered around certain ranges, it's possible that when these values are scaled, they naturally do not produce any values in certain intervals, like between 0.6 and 0.8.

## 4. Exploratory Data Analysis (EDA)
Perform EDA to gain insights into the dataset. Visualize relationships between different features and the target variable.

### Task: 8

Analyze the relationship between crop type and rainfall to determine which crops grow well under higher rainfall conditions. Visualize the data using an appropriate plot to gain insights into the rainfall preferences of different crops.

**Hint:** Use a boxplot from the seaborn library to visualize the distribution of rainfall for each crop type. The x-axis should represent the crop types, and the y-axis should show the amount of rainfall. This will help you identify which crops have higher median and upper quartile values, indicating their preference for more rainfall.

### Task 9

Analyze the relationship between temperature and crop type to determine if certain crops prefer specific temperature ranges. Visualize the temperature distribution for each crop using a violin plot or swarm plot to identify any patterns or trends in the data.

**Hint:** Use a violin plot or swarm plot from the seaborn library. For a violin plot, set the x-axis to represent the crop types and the y-axis to show temperature values, which will help visualize the spread and density of temperatures for each crop. Alternatively, use a swarm plot to display individual temperature data points for each crop type, which provides insights into clustering and spread.

**Interpretation:**

**Patterns and Trends:**

Crops with wider and denser distributions around specific temperature ranges suggest their preference for those temperature conditions.
Crops that show narrower distributions indicate a specific and possibly limited temperature range they thrive in.
If a crop's distribution extends across a broader range, it might be adaptable to varying temperature conditions.

### Task 10:

Perform a correlation analysis between the features (N, P, K, temperature, humidity, ph, and rainfall). Identify which features are strongly correlated with each other. Visualize these correlations using a heatmap and discuss how these relationships might impact crop growth or recommendations.

The majority of features show low or no correlation (correlation coefficients close to 0). This suggests that features such as N, P, K, temperature, humidity, ph, and rainfall generally operate independently of each other, impacting crop growth without direct linear relationships.

## 5. Model Training
Choose a machine learning model suitable for the dataset and train it. You may start with basic models like Decision Trees, Logistic Regression, or SVM.

### Task: 11
- Split the dataset into training and testing sets.
- Train a model and evaluate its performance.
- Drop columns that are not relevant for the model (N_Level)


## 6. Model Evaluation
Evaluate the trained model using suitable metrics such as accuracy, precision, recall, and F1 score.



### Task: 11
- Make predictions using the test dataset.
- Calculate evaluation metrics.
- Plot confusion matrix and analyze the results.


### Task 12:

The confusion matrix and classification report provide detailed insights into the model's performance:

**Confusion Matrix:**

The confusion matrix shows the number of correct and incorrect predictions made by the model for each crop type. Each row represents the actual class, and each column represents the predicted class.
Diagonal values indicate true positives (TP), where the model correctly predicted the crop type.
Off-diagonal values represent misclassifications, which are either false positives (FP) or false negatives (FN).

## Congrats ! You have now completed this weeks assignment