### Overview:
This notebook reads a dataset of questions from Kenyan farmers in Swahili, and processes them into a new dataset ready for tokenization, translation, and further NLP

### Input (in working directory): 
* CSV file of questions by 'live' and 'zombie' users -  'questions_kenya_df.csv' 
* Note:  *question_preprocess.ipynb* notebook creates the input file

### Output (to working directory): 
* CSV file of 'cleaned' questions -  'kenya_swa_q_clean.csv'
* Note:  this is input file for the *nlp_swa.ipynb* notebook

### Steps:
* print summary info
* select questions in swahili
* convert question to lowercase
* remove prefixes to questions: e.g. 'Q', 'S'
* remove "'" from "ng'"
* replace punctuation marks with ' '
* remove numbers
* drop duplicate questions


#import packages:  pandas, numpy, fastparquet (for saving lg data files)
import pandas as pd
import numpy as np
import fastparquet as fp

import nltk


In [3]:
#load processed country question file
kenya_df = pd.read_csv('questions_kenya_df.csv')
kenya_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2128052 entries, 0 to 2128051
Data columns (total 9 columns):
 #   Column                      Dtype 
---  ------                      ----- 
 0   question_id                 int64 
 1   question_user_id            int64 
 2   question_language           object
 3   question_content            object
 4   question_topic              object
 5   question_user_status        object
 6   question_user_country_code  object
 7   question_sent_date          object
 8   response_count              int64 
dtypes: int64(3), object(6)
memory usage: 146.1+ MB


In [6]:
#see initial records
print(kenya_df.head())

    question_id  question_user_id question_language  \
4       3849082            417525               swa   
6       3849096            417525               swa   
9       3849117            524698               swa   
10      3849129             54426               eng   
11      3849143            403213               swa   

                                     question_content question_topic  \
4                         S dawa ya viroboto.kwa kuku        poultry   
6                         S dawa.ya.viroboto.kwa.kuku        poultry   
9   Q:niko Na Punda,,anakohoa Ni Dawa Gany Naexa M...           None   
10                         Q#.Which plant has omega3?          plant   
11  S Ng'ombe aina kani itoayo maziwa 20 lita kwa ...         cattle   

   question_user_status question_user_country_code        question_sent_date  \
4                  live                         ke 2017-11-22 12:25:10+00:00   
6                  live                         ke 2017-11-22 12:25:12+00:00

In [4]:
#calculate # of unique values for language and topic
print("the count by language: ", kenya_df['question_language'].value_counts())
print("the count by topic: ", kenya_df['question_topic'].value_counts())
print("the number of questions with no topic: ",kenya_df['question_topic'].isna().sum())
print("the number of unique user ids: ", kenya_df['question_user_id'].nunique())

the count by language:  question_language
eng    1502530
swa     625522
Name: count, dtype: int64
the count by topic:  question_topic
cattle          255944
chicken         233918
maize           165513
tomato           90407
poultry          88464
                 ...  
leucaena            12
blackberry          11
rye                 10
purple-vetch         7
cranberry            1
Name: count, Length: 148, dtype: int64
the number of questions with no topic:  489890
the number of unique user ids:  323267


num_topic = new_df['question_topic'].value_counts()
print(num_topic)
#df.to_csv('num_topic.csv')
#print(f"Basic CSV saved to: {os.path.abspath('num_topic.csv')}")

In [14]:
#calculate number of unique users and their status:
print("the number of unique users: ", kenya_df['question_user_id'].nunique())
#print("the number of unique user status: ", kenya_df['question_user_status'].value_counts())

the number of unique users:  323267
the number of unique user status:  question_user_status
live      1565420
zombie     562632
Name: count, dtype: int64


In [5]:
#subset kenya questions in swahili with question id & topic
swa_kenya_df = kenya_df.loc[(kenya_df['question_language'] == 'swa'),['question_id','question_content', 'question_topic']]
print(swa_kenya_df.info())
print(swa_kenya_df.head(20))

<class 'pandas.core.frame.DataFrame'>
Index: 625522 entries, 0 to 2128048
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   question_id       625522 non-null  int64 
 1   question_content  625522 non-null  object
 2   question_topic    469893 non-null  object
dtypes: int64(1), object(2)
memory usage: 19.1+ MB
None
    question_id                                   question_content  \
0       3849082                        S dawa ya viroboto.kwa kuku   
1       3849096                        S dawa.ya.viroboto.kwa.kuku   
2       3849117  Q:niko Na Punda,,anakohoa Ni Dawa Gany Naexa M...   
4       3849143  S Ng'ombe aina kani itoayo maziwa 20 lita kwa ...   
5       3849195  S niko na watu kumi hapa busia kwa sasa wanaul...   
8       3849286  S      DAWA  YA  VIFARANGA   YA  KWANZIA   SIK...   
10      3849303  S Nikipanda mahindi pila mbolea, halafu ikimea...   
11      3849313  S napier ndie ifanya ngomb

In [6]:
#get topic stats
print("the count by topic: ", swa_kenya_df['question_topic'].value_counts())
print("the number of questions with no topic: ",swa_kenya_df['question_topic'].isna().sum())

the count by topic:  question_topic
cattle        79902
chicken       58416
maize         56486
poultry       54566
tomato        25691
              ...  
nightshade        1
asparagus         1
blackberry        1
corriander        1
jackfruit         1
Name: count, Length: 136, dtype: int64
the number of questions with no topic:  155629


In [14]:
#drop duplicate questions
swa_kenya_unique_df = swa_kenya_df.drop_duplicates(subset=['question_content'], keep = 'first')
print(swa_kenya_unique_df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 596750 entries, 0 to 2128048
Data columns (total 3 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   question_id       596750 non-null  int64 
 1   question_content  596750 non-null  object
 2   question_topic    447974 non-null  object
dtypes: int64(1), object(2)
memory usage: 18.2+ MB
None


In [15]:
print(swa_kenya_unique_df.head(20))

    question_id                                   question_content  \
0       3849082                        S dawa ya viroboto.kwa kuku   
1       3849096                        S dawa.ya.viroboto.kwa.kuku   
2       3849117  Q:niko Na Punda,,anakohoa Ni Dawa Gany Naexa M...   
4       3849143  S Ng'ombe aina kani itoayo maziwa 20 lita kwa ...   
5       3849195  S niko na watu kumi hapa busia kwa sasa wanaul...   
8       3849286  S      DAWA  YA  VIFARANGA   YA  KWANZIA   SIK...   
10      3849303  S Nikipanda mahindi pila mbolea, halafu ikimea...   
11      3849313  S napier ndie ifanya ngombe itoe maziwa mingi ...   
12      3849328  S. DAWA YA VIFARANGA YA KWANZIA SIKU MOJA NI G...   
13      3849376  s kupe wa ngombe wanaweza tibiwa na dawa ipi i...   
14      3849381  S je mahindi yakishamea inafaa nimwage hiyo mi...   
15      3849454  S Dawa ya kuaa konokono ni gani wamekua wengi ...   
16      3849462  S je eka moja na nusu inafaa nipandie D A P ki...   
17      3849507    S

In [22]:
print(swa_kenya_unique_df.tail(20))

5865713    S,nawaxalim nyte hamjambo?Ak nmejaribu kuulixa...
5865715    S,ak mbona leo hamreply maxwali ak ama kuna nini?
5865716      S,ak xaidia mimi ama leo mmekaxhirika aje hivo?
5865719                   Q?Kuku ikizaliwa unaipea dawa gani
5865720              Q nidawa gani ndio bora kuwauwa kunguni
5865735    S aki nadai kupanda mahindi na sinakitu ya kup...
5865741                              s mko wapi wana wefarm,
5865743                Kuna mkulima anauza kuku katika bomet
5865744    Kuna mkulima anauza kuku katika bomet.naweza k...
5865745    s je nauliza mbegu ipi naweza panda msimu huu ...
5865746          Q.nini inafanya flowe ya maharakwe iankuge.
5865749                        Q.nitabataje points ya wefarm
5865751    Ni dawa gani nawesa tumia kama mbolea kwa sham...
5865783    kuna wadudu ambao wamevamia mahindi yangu dawa...
5865789    Q,dawa gani ukitumia kwa majani jai ni kama mb...
5865799              S Je n mbolea ipi poa ya kutopdres miwa
5865800    s je ni njia 

In [16]:
#slice most common prefixes (1st 3 characters)
prefix_series = swa_kenya_unique_df['question_content'].str.slice(0, 2)
print("the count by prefix: ", prefix_series.value_counts(10))

the count by prefix:  question_content
S      0.296722
Q      0.160481
s      0.054096
S.     0.041627
Ni     0.035665
         ...   
,w     0.000002
wy     0.000002
lS     0.000002
W\n    0.000002
OA     0.000002
Name: proportion, Length: 1634, dtype: float64


In [17]:
#convert question content to lower case:
swa_kenya_unique_df.loc[:,'question_content'] = swa_kenya_unique_df.loc[:,'question_content'].str.lower()
#swa_kenya_unique_df.head(5)
#remove 's.', 's ', 'q' from start of questions
swa_kenya_unique_df.loc[:,'question_adj'] = swa_kenya_unique_df.loc[:,'question_content'].str.replace(r'^(s |q|s.)', '', regex=True)   


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  swa_kenya_unique_df.loc[:,'question_adj'] = swa_kenya_unique_df.loc[:,'question_content'].str.replace(r'^(s |q|s.)', '', regex=True)


In [18]:
swa_kenya_unique_df.head()

Unnamed: 0,question_id,question_content,question_topic,question_adj
0,3849082,s dawa ya viroboto.kwa kuku,poultry,dawa ya viroboto.kwa kuku
1,3849096,s dawa.ya.viroboto.kwa.kuku,poultry,dawa.ya.viroboto.kwa.kuku
2,3849117,"q:niko na punda,,anakohoa ni dawa gany naexa m...",,":niko na punda,,anakohoa ni dawa gany naexa mp..."
4,3849143,s ng'ombe aina kani itoayo maziwa 20 lita kwa ...,cattle,ng'ombe aina kani itoayo maziwa 20 lita kwa siku?
5,3849195,s niko na watu kumi hapa busia kwa sasa wanaul...,,niko na watu kumi hapa busia kwa sasa wanauliz...


In [19]:
#replace "ng'" with "ng":
swa_kenya_unique_df.loc[:,'question_adj'] = swa_kenya_unique_df.loc[:,'question_adj'].str.replace(r"ng\'", 'ng', regex=True)
swa_kenya_unique_df.head()

Unnamed: 0,question_id,question_content,question_topic,question_adj
0,3849082,s dawa ya viroboto.kwa kuku,poultry,dawa ya viroboto.kwa kuku
1,3849096,s dawa.ya.viroboto.kwa.kuku,poultry,dawa.ya.viroboto.kwa.kuku
2,3849117,"q:niko na punda,,anakohoa ni dawa gany naexa m...",,":niko na punda,,anakohoa ni dawa gany naexa mp..."
4,3849143,s ng'ombe aina kani itoayo maziwa 20 lita kwa ...,cattle,ngombe aina kani itoayo maziwa 20 lita kwa siku?
5,3849195,s niko na watu kumi hapa busia kwa sasa wanaul...,,niko na watu kumi hapa busia kwa sasa wanauliz...


In [25]:
#remove numbers:
swa_kenya_unique_df.loc[:,'question_adj'] = swa_kenya_unique_df.loc[:,'question_adj'].str.replace('\d+', '', regex=True)

In [26]:
swa_kenya_unique_df.head()

Unnamed: 0,question_id,question_content,question_topic,question_adj,question_clean
0,3849082,s dawa ya viroboto.kwa kuku,poultry,dawa ya viroboto.kwa kuku,dawa ya viroboto kwa kuku
1,3849096,s dawa.ya.viroboto.kwa.kuku,poultry,dawa.ya.viroboto.kwa.kuku,dawa ya viroboto kwa kuku
2,3849117,"q:niko na punda,,anakohoa ni dawa gany naexa m...",,":niko na punda,,anakohoa ni dawa gany naexa mp...",niko na punda anakohoa ni dawa gany naexa mp...
4,3849143,s ng'ombe aina kani itoayo maziwa 20 lita kwa ...,cattle,ngombe aina kani itoayo maziwa lita kwa siku?,ngombe aina kani itoayo maziwa 20 lita kwa siku
5,3849195,s niko na watu kumi hapa busia kwa sasa wanaul...,,niko na watu kumi hapa busia kwa sasa wanauliz...,niko na watu kumi hapa busia kwa sasa wanauliz...


In [27]:
#replace punctuation marks with spaces:
swa_kenya_unique_df.loc[:,'question_clean'] = swa_kenya_unique_df.loc[:,'question_adj'].str.replace(r'[^\w\s]', ' ', regex=True)
swa_kenya_unique_df.head()

Unnamed: 0,question_id,question_content,question_topic,question_adj,question_clean
0,3849082,s dawa ya viroboto.kwa kuku,poultry,dawa ya viroboto.kwa kuku,dawa ya viroboto kwa kuku
1,3849096,s dawa.ya.viroboto.kwa.kuku,poultry,dawa.ya.viroboto.kwa.kuku,dawa ya viroboto kwa kuku
2,3849117,"q:niko na punda,,anakohoa ni dawa gany naexa m...",,":niko na punda,,anakohoa ni dawa gany naexa mp...",niko na punda anakohoa ni dawa gany naexa mp...
4,3849143,s ng'ombe aina kani itoayo maziwa 20 lita kwa ...,cattle,ngombe aina kani itoayo maziwa lita kwa siku?,ngombe aina kani itoayo maziwa lita kwa siku
5,3849195,s niko na watu kumi hapa busia kwa sasa wanaul...,,niko na watu kumi hapa busia kwa sasa wanauliz...,niko na watu kumi hapa busia kwa sasa wanauliz...


In [33]:
swa_kenya_unique_df.tail(5)

Unnamed: 0,question_id,question_content,question_topic,question_adj
5865799,59254606,s je n mbolea ipi poa ya kutopdres miwa,sugar-cane,je n mbolea ipi poa ya kutopdres miwa
5865800,59254618,s je ni njia gan poa ndatumia kupandisha ng'om...,cattle,je ni njia gan poa ndatumia kupandisha ng'ombe ?
5865801,59254625,s mkulima anauliza: dawa ya viwavi jeshi ni gani?,,mkulima anauliza: dawa ya viwavi jeshi ni gani?
5865808,59254715,caren anauliza jinsi ya kupanda viazi na pilip...,chilli,caren anauliza jinsi ya kupanda viazi na pilip...
5865815,59258899,"naomba,mbegu ya mahindi..nmeishiwa kabisa na s...",maize,"naomba,mbegu ya mahindi..nmeishiwa kabisa na s..."


In [28]:
#drop duplicates again and save into new dataframe
swa_kenya_unique_df_2 = swa_kenya_unique_df.drop_duplicates(subset=['question_clean'], keep = 'first')
#print(swa_kenya_unique_df_2.info())
swa_kenya_q_clean = swa_kenya_unique_df_2[['question_id','question_topic','question_clean']]
#rename question_adj column to question_text
print(swa_kenya_q_clean.info())

<class 'pandas.core.frame.DataFrame'>
Index: 590498 entries, 0 to 2128048
Data columns (total 3 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   question_id     590498 non-null  int64 
 1   question_topic  443528 non-null  object
 2   question_clean  590498 non-null  object
dtypes: int64(1), object(2)
memory usage: 18.0+ MB
None


In [29]:
swa_kenya_q_clean.head()

Unnamed: 0,question_id,question_topic,question_clean
0,3849082,poultry,dawa ya viroboto kwa kuku
2,3849117,,niko na punda anakohoa ni dawa gany naexa mp...
4,3849143,cattle,ngombe aina kani itoayo maziwa lita kwa siku
5,3849195,,niko na watu kumi hapa busia kwa sasa wanauliz...
8,3849286,chicken,dawa ya vifaranga ya kwanzia siku ...


In [30]:
#write to csv file
swa_kenya_q_clean.to_csv('kenya_swa_q_clean.csv', index=True)