# Data Ingestion

   This notebook loads the raw mandi crop price dataset from the raw data
   directory and performs basic inspection to understand its structure,
   size, and integrity. No data cleaning or transformation is performed
   in this notebook.


In [3]:
import pandas as pd
import numpy as np

In [4]:
path = "../data/raw/raw_data.csv"
df = pd.read_csv(path)

In [5]:
df.shape

(1118899, 11)

### Dataset Size Observation

The dataset contains 1,118,899 rows and 11 columns. Each row represents
the price information of a specific crop verified by its commodity,
variety, and grade, sold in a particular mandi (market) on a given date.
This indicates that the dataset captures large-scale, real-world mandi
price behavior across multiple locations and time periods.


In [6]:
df.columns

Index(['Sl no.', 'District Name', 'Market Name', 'Commodity', 'Variety',
       'Grade', 'Min Price (Rs./Quintal)', 'Max Price (Rs./Quintal)',
       'Modal Price (Rs./Quintal)', 'Price Date', 'State'],
      dtype='object')

# First 5 Data


In [7]:
df.head()

Unnamed: 0,Sl no.,District Name,Market Name,Commodity,Variety,Grade,Min Price (Rs./Quintal),Max Price (Rs./Quintal),Modal Price (Rs./Quintal),Price Date,State
0,1,Auraiya,Achalda,Wheat,Dara,FAQ,2350.0,2550.0,2450.0,05 Apr 2025,Uttar Pradesh
1,2,Auraiya,Achalda,Wheat,Dara,FAQ,2400.0,2500.0,2470.0,14 Jun 2025,Uttar Pradesh
2,3,Auraiya,Achalda,Wheat,Dara,FAQ,2400.0,2500.0,2470.0,23 Jun 2025,Uttar Pradesh
3,4,Auraiya,Achalda,Wheat,Dara,FAQ,2400.0,2520.0,2470.0,26 Jun 2025,Uttar Pradesh
4,5,Auraiya,Achalda,Wheat,Dara,FAQ,2400.0,2550.0,2500.0,03 Jun 2025,Uttar Pradesh


In [16]:
df['Market Name'].unique()

array(['Achalda', 'Achnera', 'Agra', ..., 'Valia(Nethrang)', 'Ipur',
       'English Bazar'], dtype=object)

# Bottom 5 data

In [6]:
df.tail()

Unnamed: 0,Sl no.,District Name,Market Name,Commodity,Variety,Grade,Min Price (Rs./Quintal),Max Price (Rs./Quintal),Modal Price (Rs./Quintal),Price Date,State
1118894,17380,Vidisha,Vidisha,Lentil (Masur)(Whole),Organic,FAQ,5200.0,5200.0,5200.0,20 Nov 2024,Madhya Pradesh
1118895,17381,Vidisha,Vidisha,Lentil (Masur)(Whole),Organic,FAQ,5412.0,5412.0,5412.0,27 Feb 2025,Madhya Pradesh
1118896,17382,Vidisha,Vidisha,Lentil (Masur)(Whole),Organic,FAQ,5600.0,5600.0,5600.0,08 Jan 2025,Madhya Pradesh
1118897,17383,Vidisha,Vidisha,Lentil (Masur)(Whole),Organic,FAQ,5640.0,5720.0,5720.0,18 Jan 2025,Madhya Pradesh
1118898,1,Surat,Vyra,Lentil (Masur)(Whole),Other,FAQ,6000.0,7000.0,6500.0,10 Oct 2024,Gujarat


# Data information 

In [7]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1118899 entries, 0 to 1118898
Data columns (total 11 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   Sl no.                     1118899 non-null  int64  
 1   District Name              1118899 non-null  object 
 2   Market Name                1118899 non-null  object 
 3   Commodity                  1118899 non-null  object 
 4   Variety                    1118899 non-null  object 
 5   Grade                      1118899 non-null  object 
 6   Min Price (Rs./Quintal)    1118899 non-null  float64
 7   Max Price (Rs./Quintal)    1118899 non-null  float64
 8   Modal Price (Rs./Quintal)  1118899 non-null  float64
 9   Price Date                 1118899 non-null  object 
 10  State                      1118899 non-null  object 
dtypes: float64(3), int64(1), object(7)
memory usage: 93.9+ MB


In [8]:
df.isnull().sum()

Sl no.                       0
District Name                0
Market Name                  0
Commodity                    0
Variety                      0
Grade                        0
Min Price (Rs./Quintal)      0
Max Price (Rs./Quintal)      0
Modal Price (Rs./Quintal)    0
Price Date                   0
State                        0
dtype: int64

**No missing data there in dataset**

In [9]:
df.duplicated().sum()

np.int64(0)

**No duplicate Data there **

In [10]:
df["Price Date"].head()

0    05 Apr 2025
1    14 Jun 2025
2    23 Jun 2025
3    26 Jun 2025
4    03 Jun 2025
Name: Price Date, dtype: object

In [11]:
df["Price Date"].tail()

1118894    20 Nov 2024
1118895    27 Feb 2025
1118896    08 Jan 2025
1118897    18 Jan 2025
1118898    10 Oct 2024
Name: Price Date, dtype: object

In [12]:
df["Price Date"].sample(10)

977339    12 Feb 2025
511279    29 Apr 2025
756414    02 May 2025
461184    25 Apr 2025
669902    07 May 2025
286978    23 Dec 2024
156253    23 Jun 2025
726933    18 Jul 2025
685698    19 Mar 2025
534015    24 Jun 2025
Name: Price Date, dtype: object

In [13]:
dates = pd.to_datetime(df["Price Date"], errors="coerce")


In [14]:
oldest_date = dates.min()
newest_date = dates.max()

oldest_date, newest_date


(Timestamp('2024-08-15 00:00:00'), Timestamp('2025-08-14 00:00:00'))

In [15]:
dates.isna().sum()


np.int64(0)

In [16]:
# Save ingested (unchanged) data for cleaning stage
interim_path = "../data/interim/ingested_data.csv"
df.to_csv(interim_path, index=False)
print(f"Ingested data saved to {interim_path}")

Ingested data saved to ../data/interim/ingested_data.csv


### Time Span Observation

    The mandi price dataset spans from 2024-08-15 to 2025-08-14, covering
    approximately one year of historical price records. This time range
    is sufficient to capture short-term trends and seasonal price
    variations across mandis. No duplicate rows were found, and no null
    values were observed in any column. The dataset appears structurally
    complete and suitable for further data cleaning and time-based
    analysis.
