Skip to content

in-rolls/parse_unsearchable_rolls

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parse Unsearchable Electoral Rolls

Some of the Indian electoral rolls are searchable, with a separate text layer in the right encoding (see here). Most are not. Here, we provide scripts that parse unsearchable rolls from the following states: Bihar, Chandigarh, Delhi (English), Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttar Pradesh, and Uttarakhand.

Scripts and Test Results from Sample of PDFs

We have a script for each state given the format for each state varies slightly. The python script takes as input path to specific pdf electoral rolls that need to be parsed and produces a CSV with the following columns generally---the precise set of columns varies by state:

number (top left box in the elector field), id, elector_name, father_or_husband_name,
husband (dummy for husband), house_no, age, sex, ac_name, parl_constituency, part_no,
year, state, filename, main_town, police_station, mandal, revenue_division, district,
pin_code, polling_station_name, polling_station_address, net_electors_male,
net_electors_female, net_electors_third_gender, net_electors_total

We do some basic checks for the quality of the data including checks on data types and missing values and the size of the field. For instance, data type check may look like numeric in numeric fields, and by size of the field, we mean, for example, number of characters in a name or in a pin_code.

  1. Assam
  2. Bihar
  3. Chandigarh
  4. Dadra
  5. Daman
  6. Delhi
  7. Haryana
  8. Himachal Pradesh
  9. Jharkhand
  10. Karnataka
  11. Kerala
  12. Lakshadweep
  13. Madhya Pradesh
  14. Maharashtra
  15. Odisha
  16. Punjab
  17. Rajasthan
  18. Sikkim
  19. Tamil Nadu
  20. Telangana
  21. Tripura
  22. Uttar Pradesh
  23. Uttarakhand
  24. West Bengal

Data

The final data is posted here.

Transliteration

We tried both polyglot and indic_trans. Both have issues but indic_trans is better. indicate is yet better.

Authors

Madhu Sanjeevi and Gaurav Sood