Skip to content
This repository has been archived by the owner on Jun 9, 2023. It is now read-only.


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time

Socio-Economic Caste Census 2011

🚫 This repository has been archived. The code was written to scrape data at a point in time.

We share data on 140M+ households from 19 states which were part of the 2011 SECC.

Table of Contents:

  1. Scraping
  2. De-Duping and Removing Empty Rows
  3. Data
  4. Applications


We used to scrape the data from In all, we have data over 420M records from 19 states: arunachal pradesh, assam, bihar, chhattisgarh, gujarat, haryana, kerala, madhya pradesh, maharashtra, mizoram, nagaland, odisha, punjab, rajasthan, sikkim, tamil nadu, uttar pradesh, uttarakhand, west bengal.

De-Duping and Removing Empty Rows

The data we downloaded had over 420M rows, which is clearly too many. To clean the data, we first download from the Dataverse (to which we had uploaded the data). We compare the data to the aggregate data provided online. We found two broad reasons for excess rows: empty rows and duplicated rows. Some combinations of dropdowns have no data. In the initial data we downloaded we kept those combinations but put an empty string in all other fields starting head_of_hh.

There also a lot of duplicates. The duplicates stem from multiple entries for the auto_inclusion_deprivation_or_exclusion_or_other field per row and for village name to have multiple rows with "(GP)" and "(village)". We keep the village duplicates as they are about ~ 3M and it is not immediately clear if it is duplication or that they are different villages.

We take care of these issues in this notebook. We compare this with SQL results as a robustness check. And the following notebook compares the final data to aggregate numbers posted. There are still differences but now we have fewer rows than excess rows. Finally, we upload the deduped data to the Dataverse. (The file names have the word 'deduped' in them.)


The original data has the following columns:

state, district, tehsil, panchayat, language, auto_inclusion_deprivation_or_exclusion_or_other ('auto_inclusion_or_deprivation is radio button is clicked, exlusion if that button is clicked, other if that button is clicked) head_of_hh, gender, age, social_cat, fathers_and_mothers_name, deprivation_count, auto_inclusion_deprivation_code, total_members, hh_summary_auto_inclusion, hh_summary_auto_exclusion, hh_summary_auto_other, hh_summary_deprivation

The original and deduped data are posted at:


We use the data to develop the python package outkast that infers the caste based on last name. We also use the data to estimate percentage of female HoH by last name


Suriyan Laohaprapanon and Gaurav Sood