In [1]:
!pip install pandas




[notice] A new release of pip is available: 24.0 -> 24.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Election Results Data Aggregation and Cleaning Script

This Python script aggregates and cleans election results data from multiple CSV files stored in a hierarchical folder structure. It performs the following tasks:

1. **Imports necessary libraries**: `os`, `pandas`, and `numpy`.
2. **Defines the data folder path**: `data_folder` where the election results are stored.
3. **Initializes an empty DataFrame**: `unified_dataset` to store the combined data.
4. **Reads and concatenates all CSV files**:
   - Iterates through each state folder in the `data_folder`.
   - For each state folder, iterates through each CSV file.
   - Reads each CSV file into a DataFrame.
   - Adds columns for the state and constituency.
   - Concatenates the DataFrame to the `unified_dataset`.
5. **Cleans the dataset**:
   - Drops the column `S.N.`.
   - Removes rows where the `Candidate` column contains the string 'Total'.
   - Replaces all fields containing '-' with `NaN` (null value).
6. **Saves the cleaned dataset** to a single CSV file named `unified_dataset.csv`.

This script is useful for consolidating and cleaning election results data from various states and constituencies into a single, manageable dataset.


In [2]:
import os
import pandas as pd
import numpy as np

data_folder = "Data/Election_Results"
unified_dataset = pd.DataFrame()

# Read and concatenate all CSV files
for state_folder in os.listdir(data_folder):
    print(state_folder)
    state_path = os.path.join(data_folder, state_folder)
    if os.path.isdir(state_path):
        for file_name in os.listdir(state_path):
            file_path = os.path.join(state_path, file_name)
            if os.path.isfile(file_path):
                df = pd.read_csv(file_path)
                df["State"] = state_folder
                # Extract constituency name by removing everything after the first hyphen
                constituency_name = file_name.split(" - ")[0]
                df["Constituency"] = constituency_name
                unified_dataset = pd.concat([unified_dataset, df], ignore_index=True)

# Clean the dataset
unified_dataset = unified_dataset.drop(columns=['S.N.'])
unified_dataset = unified_dataset[~unified_dataset['Candidate'].str.contains('Total')]

# Replace all fields containing '-' with NaN (null value)
unified_dataset.replace('-', np.nan, inplace=True)

# Save the cleaned dataset to a single CSV file
unified_dataset.to_csv("unified_dataset.csv", index=False)


Andaman & Nicobar Islands
Andhra Pradesh
Arunachal Pradesh
Assam
Bihar
Chandigarh
Chhattisgarh
Dadra & Nagar Haveli and Daman & Diu
Goa
Gujarat
Haryana
Himachal Pradesh
Jammu and Kashmir
Jharkhand
Karnataka
Kerala
Ladakh
Lakshadweep
Madhya Pradesh
Maharashtra
Manipur
Meghalaya
Mizoram
Nagaland
NCT OF Delhi
Odisha
Puducherry
Punjab
Rajasthan
Sikkim
Tamil Nadu
Telangana
Tripura
Uttar Pradesh
Uttarakhand
West Bengal


  unified_dataset.replace('-', np.nan, inplace=True)


### Election Winner Extraction Script

This Python script processes the cleaned election results data to identify the winning candidate for each constituency and state. It performs the following tasks:

1. **Groups the dataset**: 
   - Groups the `unified_dataset` by `Constituency` and `State`.
   - Applies a lambda function to each group to find the row with the maximum `Total Votes`. 
   - Uses `fillna(-1)` to handle any missing values in the `Total Votes` column.
   - Resets the index of the resulting DataFrame.

2. **Selects relevant columns**:
   - Extracts the columns `Candidate`, `Party`, `Total Votes`, `Postal Votes`, `EVM Votes`, `% of Votes`, `State`, and `Constituency` from the grouped DataFrame to create `winner_dataset`.

3. **Saves the winner dataset**:
   - Writes the `winner_dataset` to a CSV file named `winner.csv` without including the index.

This script is useful for identifying and saving the winning candidates for each constituency and state from the aggregated election results data.


In [3]:
winner_dataset = unified_dataset.groupby(['Constituency', 'State']).apply(lambda x: x.loc[x['Total Votes'].fillna(-1).idxmax()]).reset_index(drop=True)
winner_dataset = winner_dataset[['Candidate', 'Party', 'Total Votes', 'Postal Votes', 'EVM Votes', '% of Votes', 'State', 'Constituency']]
winner_dataset.to_csv("winner.csv", index=False)

  winner_dataset = unified_dataset.groupby(['Constituency', 'State']).apply(lambda x: x.loc[x['Total Votes'].fillna(-1).idxmax()]).reset_index(drop=True)


### Identifying Uncontested Candidates Script

This Python script processes the cleaned election results data to identify constituencies where the `Total Votes` are missing (NaN), which can indicate uncontested candidates. It performs the following tasks:

1. **Identifies missing `Total Votes`**:
   - Filters the `unified_dataset` to find rows where the `Total Votes` column is NaN.
   - Stores the resulting DataFrame in `nan_total_votes`.

2. **Prints the DataFrame**:
   - Prints the `nan_total_votes` DataFrame to display the constituencies and candidates with missing `Total Votes`.

This script is useful for identifying constituencies with uncontested candidates, as missing `Total Votes` can indicate that no votes were cast due to the lack of opposition.

```python
# Identify rows with missing 'Total Votes'
nan_total_votes = unified_dataset[unified_dataset['Total Votes'].isna()]

# Print the DataFrame with missing 'Total Votes'
print(nan_total_votes)


In [4]:
nan_total_votes = unified_dataset[unified_dataset['Total Votes'].isna()]
print(nan_total_votes)

                                           Candidate                   Party  \
1857  MUKESHKUMAR CHANDRAKAANT DALAL ( Uncontested )  Bharatiya Janata Party   

      EVM Votes Postal Votes  Total Votes  % of Votes    State Constituency  
1857        NaN          NaN          NaN         NaN  Gujarat        Surat  


### Verification Script for winner.csv

In [5]:
import os
import pandas as pd

data_folder = "Data/Election_Results"
winner_dataset = pd.read_csv("winner.csv")

for state_folder in os.listdir(data_folder):
    state_path = os.path.join(data_folder, state_folder)
    if os.path.isdir(state_path):
        if state_folder not in winner_dataset['State'].unique():
            print(f"State {state_folder} is not present in the winner.csv")
        
        for file_name in os.listdir(state_path):
            file_path = os.path.join(state_path, file_name)
            if os.path.isfile(file_path):
                df = pd.read_csv(file_path)
                constituency_name = file_name.split(" - ")[0]
                if constituency_name not in winner_dataset[winner_dataset['State'] == state_folder]['Constituency'].unique():
                    print(f"Constituency {constituency_name} in state {state_folder} is not present in the winner.csv")


### Verification script for unified.csv


In [6]:
import os
import pandas as pd

data_folder = "Data/Election_Results"
unified_dataset = pd.read_csv("unified_dataset.csv")

for state_folder in os.listdir(data_folder):
    state_path = os.path.join(data_folder, state_folder)
    if os.path.isdir(state_path):
        if state_folder not in unified_dataset['State'].unique():
            print(f"State {state_folder} is not present in the unified_dataset.csv")
        
        for file_name in os.listdir(state_path):
            file_path = os.path.join(state_path, file_name)
            if os.path.isfile(file_path):
                constituency_name = file_name.split(" - ")[0]
                if constituency_name not in unified_dataset[unified_dataset['State'] == state_folder]['Constituency'].unique():
                    print(f"Constituency {constituency_name} in state {state_folder} is not present in the unified_dataset.csv")