### Overview: 
* This notebook reads Kenyan farmers' questions and processes the English ones for tokenization and further NLP  


### Input (in working directory): 
* CSV file of questions by **live** and **zombie** users:  'questions_kenya_df.csv' 
* Note:  *question_preprocess.ipynb* notebook creates the input file


### Output (to working directory):  
* CSV file of 'cleaned' questions: 'kenya_eng_q_clean.csv'
* This creates the input for the *nlp_eng_q_{topic}.ipynb* notebooks to create visualizations based on question topic


### Steps:
1. Top-level data exploration
2. select English questions
3. convert questions to lowercase
4. remove prefix 'q' from questions
5. replace punctuation marks with ' '
6. drop any duplicate questions


#import packages:  pandas, numpy, fastparquet (for saving lg data files)
import pandas as pd
import numpy as np
import fastparquet as fp

import nltk


In [3]:
#load processed country question file
kenya_df = pd.read_csv('questions_kenya_df.csv')
kenya_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2128052 entries, 0 to 2128051
Data columns (total 9 columns):
 #   Column                      Dtype 
---  ------                      ----- 
 0   question_id                 int64 
 1   question_user_id            int64 
 2   question_language           object
 3   question_content            object
 4   question_topic              object
 5   question_user_status        object
 6   question_user_country_code  object
 7   question_sent_date          object
 8   response_count              int64 
dtypes: int64(3), object(6)
memory usage: 146.1+ MB


In [6]:
#see initial records
print(kenya_df.head())

    question_id  question_user_id question_language  \
4       3849082            417525               swa   
6       3849096            417525               swa   
9       3849117            524698               swa   
10      3849129             54426               eng   
11      3849143            403213               swa   

                                     question_content question_topic  \
4                         S dawa ya viroboto.kwa kuku        poultry   
6                         S dawa.ya.viroboto.kwa.kuku        poultry   
9   Q:niko Na Punda,,anakohoa Ni Dawa Gany Naexa M...           None   
10                         Q#.Which plant has omega3?          plant   
11  S Ng'ombe aina kani itoayo maziwa 20 lita kwa ...         cattle   

   question_user_status question_user_country_code        question_sent_date  \
4                  live                         ke 2017-11-22 12:25:10+00:00   
6                  live                         ke 2017-11-22 12:25:12+00:00

In [4]:
#calculate # of unique values for language and topic
print("the count by language: ", kenya_df['question_language'].value_counts())
print("the count by topic: ", kenya_df['question_topic'].value_counts())
print("the number of questions with no topic: ",kenya_df['question_topic'].isna().sum())
print("the number of unique user ids: ", kenya_df['question_user_id'].nunique())

the count by language:  question_language
eng    1502530
swa     625522
Name: count, dtype: int64
the count by topic:  question_topic
cattle          255944
chicken         233918
maize           165513
tomato           90407
poultry          88464
                 ...  
leucaena            12
blackberry          11
rye                 10
purple-vetch         7
cranberry            1
Name: count, Length: 148, dtype: int64
the number of questions with no topic:  489890
the number of unique user ids:  323267


num_topic = new_df['question_topic'].value_counts()
print(num_topic)
#df.to_csv('num_topic.csv')
#print(f"Basic CSV saved to: {os.path.abspath('num_topic.csv')}")

In [14]:
#calculate number of unique users and their status:
print("the number of unique users: ", kenya_df['question_user_id'].nunique())
#print("the number of unique user status: ", kenya_df['question_user_status'].value_counts())

the number of unique users:  323267
the number of unique user status:  question_user_status
live      1565420
zombie     562632
Name: count, dtype: int64


In [4]:
#subset kenya questions in swahili with question id & topic
eng_kenya_df = kenya_df.loc[(kenya_df['question_language'] == 'eng'),['question_id','question_content', 'question_topic']]
print(eng_kenya_df.info())
print(eng_kenya_df.head(20))

<class 'pandas.core.frame.DataFrame'>
Index: 1502530 entries, 3 to 2128051
Data columns (total 3 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   question_id       1502530 non-null  int64 
 1   question_content  1502530 non-null  object
 2   question_topic    1168269 non-null  object
dtypes: int64(1), object(2)
memory usage: 45.9+ MB
None
    question_id                                   question_content  \
3       3849129                         Q#.Which plant has omega3?   
6       3849196  what are the effects of animal waste on potato...   
7       3849285  Q.How much is price of 1kg of onions farmgate ...   
9       3849295                        Q.What is the iron of hens?   
18      3849564  Q,What ìs the  best sesion  for planting pasio...   
22      3849837  Q my hens are laying small sized eggs. Custome...   
23      3849862                     q#wts price of potato n kisumu   
24      3849941  Which medicine would

In [5]:
#get topic stats
print("the count by topic: ", eng_kenya_df['question_topic'].value_counts())
print("the number of questions with no topic: ",eng_kenya_df['question_topic'].isna().sum())

the count by topic:  question_topic
cattle          176042
chicken         175502
maize           109027
plant            65294
tomato           64716
                 ...  
leucaena            12
blackberry          10
rye                 10
purple-vetch         7
cranberry            1
Name: count, Length: 148, dtype: int64
the number of questions with no topic:  334261


In [7]:
#drop duplicate questions
eng_kenya_unique_df = eng_kenya_df.drop_duplicates(subset=['question_content'], keep = 'first')
print(eng_kenya_unique_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 1417596 entries, 3 to 2128051
Data columns (total 3 columns):
 #   Column            Non-Null Count    Dtype 
---  ------            --------------    ----- 
 0   question_id       1417596 non-null  int64 
 1   question_content  1417596 non-null  object
 2   question_topic    1104800 non-null  object
dtypes: int64(1), object(2)
memory usage: 43.3+ MB
None


In [8]:
print(eng_kenya_unique_df.head(20))

    question_id                                   question_content  \
3       3849129                         Q#.Which plant has omega3?   
6       3849196  what are the effects of animal waste on potato...   
7       3849285  Q.How much is price of 1kg of onions farmgate ...   
9       3849295                        Q.What is the iron of hens?   
18      3849564  Q,What ìs the  best sesion  for planting pasio...   
22      3849837  Q my hens are laying small sized eggs. Custome...   
23      3849862                     q#wts price of potato n kisumu   
24      3849941  Which medicine would i apply on spinach whose ...   
28      3850312  Q  Is good to feed dairy cattle with stale paw...   
29      3850314  Q I have a capital of 30,000 to invest in loca...   
31      3850404               Q  how  do  we  milk  a   sick  cow.   
32      3850447              Q when  is  d0cking d0ne  in  a pig .   
33      3850467  Q what is the best treatment for mucus appeara...   
34      3850480  Q h

In [9]:
print(eng_kenya_unique_df.tail(20))

         question_id                                   question_content  \
2128024     59254473  Which species of sukumawiki is best for planti...   
2128025     59254479        Q Daudi asks: How Much Is 1kg Of Chia Seed?   
2128026     59254491  Which  medicine  did  you  give  a cow,who  is...   
2128027     59254515                    Q, The Best Dewormers In Cattle   
2128030     59254543  S what exactlly causes dumping off in cabbage ...   
2128031     59254574           Q  Can Rabbits Survive On Pellets Alone?   
2128035     59254647  S.plot for sale within tongaren costituency ma...   
2128036     59254650                        How to keep poultry farming   
2128037     59254651  S.a plot for sale within tongaren costituency ...   
2128039     59254679  Q#if i buy small kienyeji chicks how long will...   
2128040     59254683  Q#if i buy small kienyeji chicks how long will...   
2128042     59254728  Q Where would I get certified coffee seedlings...   
2128043     59254778     

In [12]:
#slice 1st 3 characters to find most common prefixes
prefix_series = eng_kenya_unique_df['question_content'].str.slice(0, 2)
print("the count by prefix: ", prefix_series.value_counts(5))

the count by prefix:  question_content
Q     3.843479e-01
Wh    9.098431e-02
Q.    8.304834e-02
Q:    4.662753e-02
Q,    4.490137e-02
          ...     
d6    7.054196e-07
fn    7.054196e-07
5r    7.054196e-07
Gz    7.054196e-07
H#    7.054196e-07
Name: proportion, Length: 2025, dtype: float64


In [14]:
#convert question content to lower case:
eng_kenya_unique_df.loc[:,'question_content'] = eng_kenya_unique_df.loc[:,'question_content'].str.lower()
#swa_kenya_unique_df.head(5)
#remove 'q' from start of questions
eng_kenya_unique_df.loc[:,'question_adj'] = eng_kenya_unique_df.loc[:,'question_content'].str.replace('q', '', regex=True)   


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eng_kenya_unique_df.loc[:,'question_adj'] = eng_kenya_unique_df.loc[:,'question_content'].str.replace('q', '', regex=True)


In [15]:
eng_kenya_unique_df.head()

Unnamed: 0,question_id,question_content,question_topic,question_adj
3,3849129,q#.which plant has omega3?,plant,#.which plant has omega3?
6,3849196,what are the effects of animal waste on potato...,animal,what are the effects of animal waste on potato...
7,3849285,q.how much is price of 1kg of onions farmgate ...,onion,.how much is price of 1kg of onions farmgate p...
9,3849295,q.what is the iron of hens?,chicken,.what is the iron of hens?
18,3849564,"q,what ìs the best sesion for planting pasio...",passion-fruit,",what ìs the best sesion for planting pasion..."


In [16]:
#remove numbers:
eng_kenya_unique_df.loc[:,'question_adj'] = eng_kenya_unique_df.loc[:,'question_adj'].str.replace('\d+', '', regex=True)

In [17]:
eng_kenya_unique_df.head()

Unnamed: 0,question_id,question_content,question_topic,question_adj
3,3849129,q#.which plant has omega3?,plant,#.which plant has omega?
6,3849196,what are the effects of animal waste on potato...,animal,what are the effects of animal waste on potato...
7,3849285,q.how much is price of 1kg of onions farmgate ...,onion,.how much is price of kg of onions farmgate pr...
9,3849295,q.what is the iron of hens?,chicken,.what is the iron of hens?
18,3849564,"q,what ìs the best sesion for planting pasio...",passion-fruit,",what ìs the best sesion for planting pasion..."


In [18]:
#replace punctuation marks with spaces:
eng_kenya_unique_df.loc[:,'question_clean'] = eng_kenya_unique_df.loc[:,'question_adj'].str.replace(r'[^\w\s]', ' ', regex=True)
eng_kenya_unique_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eng_kenya_unique_df.loc[:,'question_clean'] = eng_kenya_unique_df.loc[:,'question_adj'].str.replace(r'[^\w\s]', ' ', regex=True)


Unnamed: 0,question_id,question_content,question_topic,question_adj,question_clean
3,3849129,q#.which plant has omega3?,plant,#.which plant has omega?,which plant has omega
6,3849196,what are the effects of animal waste on potato...,animal,what are the effects of animal waste on potato...,what are the effects of animal waste on potato...
7,3849285,q.how much is price of 1kg of onions farmgate ...,onion,.how much is price of kg of onions farmgate pr...,how much is price of kg of onions farmgate pr...
9,3849295,q.what is the iron of hens?,chicken,.what is the iron of hens?,what is the iron of hens
18,3849564,"q,what ìs the best sesion for planting pasio...",passion-fruit,",what ìs the best sesion for planting pasion...",what ìs the best sesion for planting pasion...


In [19]:
eng_kenya_unique_df.tail(5)

Unnamed: 0,question_id,question_content,question_topic,question_adj,question_clean
2128046,59256156,how much can i get in onion 1acre,onion,how much can i get in onion acre,how much can i get in onion acre
2128047,59256225,q which crop should we plant in this very litt...,plantain,which crop should we plant in this very littl...,which crop should we plant in this very littl...
2128049,59259045,i want to grow cabbage someone to give me the ...,cabbage,i want to grow cabbage someone to give me the ...,i want to grow cabbage someone to give me the ...
2128050,59260982,q how can i permanently control birds destroyi...,maize,how can i permanently control birds destroyin...,how can i permanently control birds destroyin...
2128051,59261512,q. which is the best season of dlanting tomato,tomato,. which is the best season of dlanting tomato,which is the best season of dlanting tomato


In [20]:
#drop duplicates again and save into new dataframe
eng_kenya_unique_df_2 = eng_kenya_unique_df.drop_duplicates(subset=['question_clean'], keep = 'first')
eng_kenya_q_clean = eng_kenya_unique_df_2[['question_id','question_topic','question_clean']]
#rename question_adj column to question_text
print(eng_kenya_q_clean.info())

<class 'pandas.core.frame.DataFrame'>
Index: 1384774 entries, 3 to 2128051
Data columns (total 3 columns):
 #   Column          Non-Null Count    Dtype 
---  ------          --------------    ----- 
 0   question_id     1384774 non-null  int64 
 1   question_topic  1083079 non-null  object
 2   question_clean  1384774 non-null  object
dtypes: int64(1), object(2)
memory usage: 42.3+ MB
None


In [25]:
#write to csv file
eng_kenya_q_clean.to_csv('kenya_eng_q_clean.csv', index=True)