This notebook creates a 500k sample of the Producers Direct Farmers dataset. Then NLP Semantic Model is used to categorize the questions for Challenge 2. The output is a csv of the 500k rows with categorized labels, ready for analysis. There's also code at the bottom to create a profile report for quick EDA.
Adding the analysis of questions as well to the bottom.

In [2]:
import pandas as pd
#from ydata_profiling import ProfileReport

#from sentence_transformers import SentenceTransformer, util
#import torch

In [None]:
# Load the Parquet file
df = pd.read_parquet('/kaggle/input/producersdirectdata-parquet')

# Filter rows where question_language is 'eng'
filtered_df = df[df['question_language'] == 'eng']

filtered_df = filtered_df.reset_index(drop=True)

In [None]:
#sampling random 500k rows for speed

sampled_df = filtered_df.sample(n=500000, random_state=42)

#sampled_df

In [None]:
##identify unique question types &  develop summary categories

#unique_question_topic = sampled_df['question_topic'].unique().tolist()

#print(unique_question_topic)

In [None]:
## can see a wide range of topics such as: fruits, vegetables, animals, insects, flowers - use an NLP approach to categorize them automatically

##semantic similarity model - testing on small subest before running fully
#by testing on small subset, and re randomizing, I can create new categories based on responses
#splitting 500k into 10 unique batches for quicker inference on semtantic model

# drop na's
#sampled_df = sampled_df.dropna(subset=['question_topic'])
sampled_df = sampled_df.dropna(subset=['question_content'])

# Define categories
categories = ["livestock", "harvesting", "planting", "pests", "markets", "fruits", "vegetables", "seeds", "nuts", "weather", "equipment",
             "soil", "vaccines", "raising livestock"]

# Load model
model = SentenceTransformer('/kaggle/input/all-minilm-l6-v2')

# Encode categories once
category_embeddings = model.encode(categories, convert_to_tensor=True)

# Split into 10 unique random samples of 50k rows
chunk_size = 50000
chunks = []
remaining_df = sampled_df.copy()

for _ in range(10):
    sample = remaining_df.sample(n=chunk_size, random_state=42)
    remaining_df = remaining_df.drop(sample.index)
    chunks.append(sample)

# Function to classify a batch
def classify_batch(df_chunk):
    question_embeddings = model.encode(df_chunk['question_content'].tolist(), convert_to_tensor=True, batch_size=32)
    similarities = util.cos_sim(question_embeddings, category_embeddings)
    df_chunk['predicted_category'] = [categories[torch.argmax(sim).item()] for sim in similarities]
    return df_chunk

# Process each chunk
classified_chunks = [classify_batch(chunk) for chunk in chunks]

# Combine results
final_df = pd.concat(classified_chunks, ignore_index=True)
final_df.to_csv('classified_questions_output.csv', index=False)


In [None]:
# reviewing output before full run
review_df = sampled_df[['question_topic',  'question_content', 'predicted_category','response_content']]

review_df.tail(25)

In [None]:
##code to build the pandas report
profile = ProfileReport(sampled_df, title="EDA Report", explorative=True)
profile.to_notebook_iframe()  # If you're in a Jupyter notebook

In [None]:
profile.to_file("eda_report_producers_direct_farmers.html")

In [6]:
##analysis below of the classified questions

df = pd.read_csv('/kaggle/input/classified-questions-full/classified_questions_output.csv')

In [7]:
df.head(15)

Unnamed: 0,question_id,question_user_id,question_language,question_content,question_topic,question_sent,response_id,response_user_id,response_language,response_content,...,question_user_gender,question_user_dob,question_user_created_at,response_user_type,response_user_status,response_user_country_code,response_user_gender,response_user_dob,response_user_created_at,predicted_category
0,25880162,1913528,eng,What is weeding.,,2019-05-22 04:01:52.224296+00,25881168,563347,eng,Q1704 Weeding Ir The Removing Of Unwanted Plan...,...,,,2019-02-25 07:50:08.739753+00,farmer,live,ug,,,2017-12-04 12:46:51+00,planting
1,24592875,1488760,eng,"Q,my Calf Is Very Weak It Does Not Feed Well...",cattle,2019-05-01 10:39:12.058657+00,24593425,1975032,swa,Vitamin,...,,,2018-11-16 07:05:03.975227+00,farmer,live,ke,,,2019-03-21 17:34:00.423062+00,livestock
2,7429361,848131,eng,The method of farming in growing rice,rice,2018-07-05 18:25:05.058956+00,7466822,635518,eng,Q123 MONOCROPING ..,...,,,2018-05-12 13:07:47.337269+00,farmer,destroyed,ug,,,2018-01-22 04:11:53+00,planting
3,12568368,77169,eng,Q. I have 2 live tock 1 for hens and 1 for goa...,chicken,2018-10-03 08:51:32.491328+00,12568651,97846,eng,Q542 buy hay and commacial feeds,...,,,2016-06-09 05:18:04+00,farmer,live,ke,male,1984-02-15,2016-09-20 04:57:57+00,raising livestock
4,7959109,911346,eng,What Is The Best Way Of Making A Silage And Su...,,2018-07-26 14:22:11.890114+00,7959191,1017884,eng,Q1 by drying the matter--DENNIS,...,,,2018-06-14 05:22:58.732969+00,farmer,zombie,ke,,,2018-07-26 14:16:56.325087+00,planting
5,8609983,705702,eng,How will know that the cow it is on heat?,cattle,2018-08-11 19:37:10.882334+00,8610068,1028828,eng,Q163 If the cow is climbing on male cattle,...,,,2018-03-02 17:41:11.089571+00,farmer,zombie,ke,,,2018-08-02 04:43:02.401549+00,livestock
6,43240546,2516140,eng,A Q if the tree fell on my crops while being ...,tree,2020-06-11 15:06:30.732679+00,43240960,2991207,eng,The owner of the power saw he should tie the t...,...,,,2019-09-25 16:44:14.199926+00,farmer,live,ke,,,2020-03-26 17:35:17.926767+00,planting
7,58527736,3602110,eng,What is akofu in english,,2021-08-22 16:30:53.492357+00,58528114,3362845,eng,Q306 i dont know,...,,,2021-03-19 06:37:00.563176+00,farmer,live,ug,,,2020-10-07 13:50:24.003525+00,livestock
8,17820594,21487,eng,Q.I'm planting my tomatoes under irrigation;wh...,tomato,2018-12-03 14:03:28.856492+00,17820901,1210341,eng,q36pangozep,...,male,1986-11-03,2015-09-17 15:48:15+00,farmer,live,ke,,,2018-09-16 19:47:46.61179+00,vegetables
9,44028077,2728610,eng,Qn what are some of diseases which attack ban...,banana,2020-06-29 19:39:11.341461+00,44028362,226486,lug,Q245 biwuka,...,,,2019-12-07 05:44:27.636529+00,farmer,blocked,ug,female,1978-05-25,2017-06-15 07:08:34+00,fruits


In [9]:
df.dtypes

question_id                    int64
question_user_id               int64
question_language             object
question_content              object
question_topic                object
question_sent                 object
response_id                    int64
response_user_id               int64
response_language             object
response_content              object
response_topic                object
response_sent                 object
question_user_type            object
question_user_status          object
question_user_country_code    object
question_user_gender          object
question_user_dob             object
question_user_created_at      object
response_user_type            object
response_user_status          object
response_user_country_code    object
response_user_gender          object
response_user_dob             object
response_user_created_at      object
predicted_category            object
dtype: object

In [None]:
## group questions by season and topic - count occurence of each question
#need to create season
# define seasons by country
#kenya - long rains = march > may also main planting, short rains = october > december also secondary planting.
## harvest periods: june > august, Jan>feb, 
# uganda - cropping - season A - Planting = march>may, harvesting june > august
# season b - Planting = sept > november, harvesting december > feb
#tanzania - Masika Rains = march > may,Vuli rains = october>december
# planting - masika planting march, vuli planting october
# havesting june > august

#question_sent > create month from date > date-object, convert to datetime then extract month as word

In [19]:
#get month from date
df['question_sent'] = pd.to_datetime(df['question_sent'], utc=True, format='ISO8601')

df['month'] = df['question_sent'].dt.strftime('%B')

#df.head(15)

In [25]:
#seaon mapping
#currently only kenya and uganda - tanzania not in this data set


season_matrix = {
    ('ke', 'January'): 'Harvesting',
    ('ke', 'February'): 'Harvesting',
    ('ke', 'March'): 'Long Rains Planting',
    ('ke', 'April'): 'Long Rains Planting',
    ('ke', 'May'): 'Long Rains Planting',
    ('ke', 'June'): 'Harvesting',
    ('ke', 'July'): 'Harvesting',
    ('ke', 'August'): 'Harvesting',
    ('ke', 'September'): 'Harvesting',
    ('ke', 'October'): 'Short Rains  Secondary Planting',
    ('ke', 'November'): 'Short Rains  Secondary Planting',
    ('ke', 'December'): 'Short Rains  Secondary Planting',
    ('ug', 'January'): 'Season B Harvesting',
    ('ug', 'February'): 'Season B Harvesting',
    ('ug', 'March'): 'Season A Planting',
    ('ug', 'April'): 'Season A Planting',
    ('ug', 'May'): 'Season A Planting',
    ('ug', 'June'): 'Season A Harvesting',
    ('ug', 'July'): 'Season A Harvesting',
    ('ug', 'August'): 'Season A Harvesting',
    ('ug', 'September'): 'Season B Planting',
    ('ug', 'October'): 'Season B Planting',
    ('ug', 'November'): 'Season B Planting',
    ('ug', 'December'): 'Season B Harvesting',

}

df['Season'] = df.apply(lambda row: season_matrix.get((row['question_user_country_code'], row['month']), ''), axis=1)

df.head(15)


Unnamed: 0,question_id,question_user_id,question_language,question_content,question_topic,question_sent,response_id,response_user_id,response_language,response_content,...,question_user_created_at,response_user_type,response_user_status,response_user_country_code,response_user_gender,response_user_dob,response_user_created_at,predicted_category,month,Season
0,25880162,1913528,eng,What is weeding.,,2019-05-22 04:01:52.224296+00:00,25881168,563347,eng,Q1704 Weeding Ir The Removing Of Unwanted Plan...,...,2019-02-25 07:50:08.739753+00,farmer,live,ug,,,2017-12-04 12:46:51+00,planting,May,Season A Planting
1,24592875,1488760,eng,"Q,my Calf Is Very Weak It Does Not Feed Well...",cattle,2019-05-01 10:39:12.058657+00:00,24593425,1975032,swa,Vitamin,...,2018-11-16 07:05:03.975227+00,farmer,live,ke,,,2019-03-21 17:34:00.423062+00,livestock,May,Long Rains Planting
2,7429361,848131,eng,The method of farming in growing rice,rice,2018-07-05 18:25:05.058956+00:00,7466822,635518,eng,Q123 MONOCROPING ..,...,2018-05-12 13:07:47.337269+00,farmer,destroyed,ug,,,2018-01-22 04:11:53+00,planting,July,Season A Harvesting
3,12568368,77169,eng,Q. I have 2 live tock 1 for hens and 1 for goa...,chicken,2018-10-03 08:51:32.491328+00:00,12568651,97846,eng,Q542 buy hay and commacial feeds,...,2016-06-09 05:18:04+00,farmer,live,ke,male,1984-02-15,2016-09-20 04:57:57+00,raising livestock,October,Short Rains Secondary Planting
4,7959109,911346,eng,What Is The Best Way Of Making A Silage And Su...,,2018-07-26 14:22:11.890114+00:00,7959191,1017884,eng,Q1 by drying the matter--DENNIS,...,2018-06-14 05:22:58.732969+00,farmer,zombie,ke,,,2018-07-26 14:16:56.325087+00,planting,July,Harvesting
5,8609983,705702,eng,How will know that the cow it is on heat?,cattle,2018-08-11 19:37:10.882334+00:00,8610068,1028828,eng,Q163 If the cow is climbing on male cattle,...,2018-03-02 17:41:11.089571+00,farmer,zombie,ke,,,2018-08-02 04:43:02.401549+00,livestock,August,Harvesting
6,43240546,2516140,eng,A Q if the tree fell on my crops while being ...,tree,2020-06-11 15:06:30.732679+00:00,43240960,2991207,eng,The owner of the power saw he should tie the t...,...,2019-09-25 16:44:14.199926+00,farmer,live,ke,,,2020-03-26 17:35:17.926767+00,planting,June,Harvesting
7,58527736,3602110,eng,What is akofu in english,,2021-08-22 16:30:53.492357+00:00,58528114,3362845,eng,Q306 i dont know,...,2021-03-19 06:37:00.563176+00,farmer,live,ug,,,2020-10-07 13:50:24.003525+00,livestock,August,Season A Harvesting
8,17820594,21487,eng,Q.I'm planting my tomatoes under irrigation;wh...,tomato,2018-12-03 14:03:28.856492+00:00,17820901,1210341,eng,q36pangozep,...,2015-09-17 15:48:15+00,farmer,live,ke,,,2018-09-16 19:47:46.61179+00,vegetables,December,Short Rains Secondary Planting
9,44028077,2728610,eng,Qn what are some of diseases which attack ban...,banana,2020-06-29 19:39:11.341461+00:00,44028362,226486,lug,Q245 biwuka,...,2019-12-07 05:44:27.636529+00,farmer,blocked,ug,female,1978-05-25,2017-06-15 07:08:34+00,fruits,June,Season A Harvesting


In [30]:
season_df = (
    df.groupby(['month', 'Season'])['predicted_category']
      .value_counts()
      .reset_index(name='count_of_question_cat')
)

season_df


Unnamed: 0,month,Season,predicted_category,count_of_question_cat
0,April,Long Rains Planting,planting,8959
1,April,Long Rains Planting,livestock,5461
2,April,Long Rains Planting,vegetables,2797
3,April,Long Rains Planting,raising livestock,1882
4,April,Long Rains Planting,seeds,1711
...,...,...,...,...
347,September,Season B Planting,harvesting,622
348,September,Season B Planting,vaccines,522
349,September,Season B Planting,equipment,262
350,September,Season B Planting,nuts,214


In [34]:
#plotting the seasonal data
import plotly.express as px
fig = px.bar(season_df,
             x='Season',
             y='count_of_question_cat',
             color='predicted_category',
             barmode='group',
             title='Number of Questions Asked by Season',
             labels={'Season': 'Season', 'count_of_question_cat': 'Count', 'predicted_category': 'Category'})
fig.show()

ValueError: 
Image export using the "kaleido" engine requires the kaleido package,
which can be installed using pip:
    $ pip install -U kaleido
