We share data on 140M+ households from 19 states which were part of the 2011 SECC.
Table of Contents:
We used secc.py to scrape the data from http://18.104.22.168/netnrega/secc_list.aspx. In all, we have data over 420M records from 19 states: arunachal pradesh, assam, bihar, chhattisgarh, gujarat, haryana, kerala, madhya pradesh, maharashtra, mizoram, nagaland, odisha, punjab, rajasthan, sikkim, tamil nadu, uttar pradesh, uttarakhand, west bengal.
The data we downloaded had over 420M rows, which is clearly too many. To clean the data, we first download from the Dataverse (to which we had uploaded the data). We compare the data to the aggregate data provided online. We found two broad reasons for excess rows: empty rows and duplicated rows. Some combinations of dropdowns have no data. In the initial data we downloaded we kept those combinations but put an empty string in all other fields starting
There also a lot of duplicates. The duplicates stem from multiple entries for the
auto_inclusion_deprivation_or_exclusion_or_other field per row and for village name to have multiple rows with "(GP)" and "(village)". We keep the village duplicates as they are about ~ 3M and it is not immediately clear if it is duplication or that they are different villages.
We take care of these issues in this notebook. We compare this with SQL results as a robustness check. And the following notebook compares the final data to aggregate numbers posted. There are still differences but now we have fewer rows than excess rows. Finally, we upload the deduped data to the Dataverse. (The file names have the word 'deduped' in them.)
The original data has the following columns:
state, district, tehsil, panchayat, language, auto_inclusion_deprivation_or_exclusion_or_other ('auto_inclusion_or_deprivation is radio button is clicked, exlusion if that button is clicked, other if that button is clicked) head_of_hh, gender, age, social_cat, fathers_and_mothers_name, deprivation_count, auto_inclusion_deprivation_code, total_members, hh_summary_auto_inclusion, hh_summary_auto_exclusion, hh_summary_auto_other, hh_summary_deprivation
The original and deduped data are posted at: https://doi.org/10.7910/DVN/LIIBNB
Suriyan Laohaprapanon and Gaurav Sood