# Data Cleaning Challenge - Optional cleaning (Taken from nb5)

## INTRO (Text taken from the `More Practice!` section of the nb5)

Do any other columns in this dataframe have inconsistent data entry? If you can find any, try to tidy them up.

You can also try reading in the `PakistanSuicideAttacks Ver 6 (10-October-2017).csv` file from this dataset and tidying up any inconsistent columns in that data file.

## Setting up Environment

First, we will load the necessary libraries required for this analysis. The datasets will be introduced later in the process.

In [1]:
# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# helpful modules
import thefuzz
from thefuzz import process
import chardet

# Handling directories
import os
import kaggle_cleaning
from kaggle_cleaning.config import RAW_DATA_DIR, CLEAN_DATA_DIR

# set seed for reproducibility
np.random.seed(0)

## Read CSV file

When attempting to read the `'PakistanSuicideAttacks Ver 6 (10-October-2017).csv'` file for the first time, a character encoding error occurred. To resolve this, I will first determine the correct encoding before proceeding.

In [2]:
psa_file_v6 = os.path.join(RAW_DATA_DIR, 'PakistanSuicideAttacks Ver 6 (10-October-2017).csv')
with open(psa_file_v6, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(20000))
result['encoding']

'Windows-1252'

Using the 'Windows-1252' encoding, I will now read the CSV file and store it as a DataFrame.

In [3]:
df = pd.read_csv(psa_file_v6, encoding = result['encoding'])

## Initial Data Examination  

To assess the structure and quality of the dataset, an initial examination is carried out. This helps identify missing values, data types, and potential inconsistencies in the records. Additionally, a statistical summary provides insight into the distribution of numerical variables, allowing for the detection of anomalies that may require further attention.  



In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 492 entries, 0 to 491
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   S#                       492 non-null    int64  
 1   Date                     492 non-null    object 
 2   Islamic Date             336 non-null    object 
 3   Blast Day Type           481 non-null    object 
 4   Holiday Type             72 non-null     object 
 5   Time                     281 non-null    object 
 6   City                     492 non-null    object 
 7   Latitude                 490 non-null    float64
 8   Longitude                490 non-null    object 
 9   Province                 492 non-null    object 
 10  Location                 489 non-null    object 
 11  Location Category        457 non-null    object 
 12  Location Sensitivity     456 non-null    object 
 13  Open/Closed Space        457 non-null    object 
 14  Influencing Event/Event  1

In [5]:
df.describe()

Unnamed: 0,S#,Latitude,Killed Min,Killed Max,Injured Min,No. of Suicide Blasts,Temperature(C),Temperature(F)
count,492.0,490.0,346.0,476.0,361.0,410.0,487.0,485.0
mean,246.5,32.621874,14.855491,15.30042,31.601108,1.117073,21.075554,69.906373
std,142.172431,2.477845,17.659154,20.32867,38.762832,0.396749,8.390859,15.108953
min,1.0,24.879503,0.0,0.0,0.0,1.0,-2.37,27.734
25%,123.75,31.8238,3.0,3.0,7.0,1.0,14.6175,58.253
50%,246.5,33.5833,8.0,8.0,20.0,1.0,21.295,70.331
75%,369.25,34.0043,20.75,19.0,40.0,1.0,28.115,82.499
max,492.0,35.3833,125.0,148.0,320.0,4.0,44.0,111.0


In [6]:
df.describe(include= ['category','object'])

Unnamed: 0,Date,Islamic Date,Blast Day Type,Holiday Type,Time,City,Longitude,Province,Location,Location Category,Location Sensitivity,Open/Closed Space,Influencing Event/Event,Target Type,Targeted Sect if any,Injured Max,Explosive Weight (max),Hospital Names
count,492,336,481,72,281,492,490.0,492,489,457,456,457,187,466,97,460,169,294
unique,451,301,3,15,216,93,97.0,9,483,24,4,6,169,24,7,91,112,236
top,Thursday-July 19-2007,5 Rajab 1428 A.H,Working Day,Weekend,Evening,Peshawar,71.5448,KPK,Imambargah,Police,High,Open,Sectarian strife,Military,Shiite,2,6kg,Lady Reading Hospital
freq,3,3,398,45,12,71,71.0,250,3,90,264,317,7,115,38,20,8,20


### Observations from the Preliminary Analysis of the Dataset

Following an initial review of the dataset, the following adjustments and refinements are recommended:

- The `'Date'` column should be converted to `datetime` format.
- The `'Longitude'` column should be converted to `float` type.
- There are two rows where both `'Latitude'` and `'Longitude'` contain null values. If a city and/or province is recorded, the geographical location can be researched and imputed accordingly.
- The `'Injured Max'` column should be converted to a numeric type.
- The following columns should be converted to `integer` type: `'Killed min'`, `'Killed max'`, `'Injured min'`, `'Injured max'`, and `'Suicide Blasts'`.
- The `'Explosive Weight (max)'` column may require further analysis to ensure accuracy.
- The suitability of converting temperatures from Celsius to Fahrenheit (`Temperature C to F`) should be assessed.
- The `"Open/Closed Space"` column contains six unique values, whereas the expected number was only two or possibly three. This discrepancy suggests inconsistencies that require further investigation and standardisation.
- Lists such as cities and provinces should be reviewed for consistency and correctness.

## Standardising Date Formats and Correcting Misspellings

This cell addresses inconsistencies in the `'Date'` column by performing the following steps:

- Corrects a known misspelling in the dataset (`'Thursay' → 'Thursday'`).
- Defines a list of expected date formats to ensure proper conversion.
- Implements a function to iteratively attempt conversion using the predefined formats.
- Applies the function to standardise all date entries.
- Identifies any remaining entries that could not be converted, allowing for further investigation.


In [7]:
# Fix incorrect mispelled date value
df.loc[452, 'Date'] = 'Thursday-Aug 27-2015'

# List of possible date formats
date_formats = [
    '%A-%B %d-%Y',
    '%A-%B-%d-%Y',
    '%A-%b-%d-%Y',    
    '%A-%b %d-%Y'     
]

# Function to convert dates by trying multiple formats
def parse_dates(date):
    for fmt in date_formats:
        try:
            return pd.to_datetime(date, format=fmt)
        except ValueError:
            continue
    return pd.NaT  # Return NaT if no format matches

# Convert and replace the 'Date' column
df['Date'] = df['Date'].apply(parse_dates)

# Check for any remaining NaT values
df[df['Date'].isna()]


Unnamed: 0,S#,Date,Islamic Date,Blast Day Type,Holiday Type,Time,City,Latitude,Longitude,Province,...,Targeted Sect if any,Killed Min,Killed Max,Injured Min,Injured Max,No. of Suicide Blasts,Explosive Weight (max),Hospital Names,Temperature(C),Temperature(F)


## Converting Longitude to Float and Handling Missing Values

This cell ensures the `'Longitude'` column is correctly formatted as a numeric type and addresses missing values in the dataset:

- Converts the `'Longitude'` column to `float` for consistency in numerical operations.
- Identifies and displays rows where `'Longitude'` values are missing.
- Since both missing entries have corresponding `'City'` and `'Province'` values, their geographic coordinates are retrieved from **reliable sources**.
- The missing latitude and longitude values are manually updated using coordinates from Wikipedia (last checked on **9th March 2025**):
  - **Lahore, Punjab** → (`Latitude: 31.549722, Longitude: 74.343611`)
  - **Quetta, Balochistan** → (`Latitude: 30.183333, Longitude: 67.000000`)
- The corrected entries are displayed for verification.

In [8]:
# Lets change the type of the Longitude columnt to float
df['Longitude'] = df['Longitude'].astype(float)

# Then Lets check the positions where there are two empty values
display(df[df['Longitude'].isna()])

# As there is a City and Province for the two empty values lets get it from Internet (both last checked on 2025-03-09)
# https://en.wikipedia.org/wiki/Lahore 
df.loc[449, ['Latitude','Longitude']]=[31.549722, 74.343611] 
# https://en.wikipedia.org/wiki/Quetta 
df.loc[450, ['Latitude','Longitude']]=[30.183333, 67.0] 
display(df.loc[[449,450],:])

Unnamed: 0,S#,Date,Islamic Date,Blast Day Type,Holiday Type,Time,City,Latitude,Longitude,Province,...,Targeted Sect if any,Killed Min,Killed Max,Injured Min,Injured Max,No. of Suicide Blasts,Explosive Weight (max),Hospital Names,Temperature(C),Temperature(F)
449,450,2015-06-29,,Working Day,,,Lahore,,,Punjab,...,,1.0,4.0,,http://www.samaa.tv/pakistan/29-Jun-2015/shoot...,,,,,
450,451,2015-07-17,,Holiday,,,Quetta,,,Balochistan,...,,1.0,1.0,,,,,,,


Unnamed: 0,S#,Date,Islamic Date,Blast Day Type,Holiday Type,Time,City,Latitude,Longitude,Province,...,Targeted Sect if any,Killed Min,Killed Max,Injured Min,Injured Max,No. of Suicide Blasts,Explosive Weight (max),Hospital Names,Temperature(C),Temperature(F)
449,450,2015-06-29,,Working Day,,,Lahore,31.549722,74.343611,Punjab,...,,1.0,4.0,,http://www.samaa.tv/pakistan/29-Jun-2015/shoot...,,,,,
450,451,2015-07-17,,Holiday,,,Quetta,30.183333,67.0,Balochistan,...,,1.0,1.0,,,,,,,


## Correcting  `'Injured Max'` column

This cell processes the `'Injured Max'` column to maintain numerical consistency:

- Converts `'Injured Max'` to numeric values, handling non-numeric entries as `NaN`.
- Fills missing values in `'Injured Max'` with the column’s median to preserve distribution integrity.
- Ensures that `'Injured Max'` is at least equal to `'Injured Min'` by applying a row-wise maximum.

This guarantees that the reported maximum injuries are never lower than the minimum injuries.

In [9]:
# Convert 'Injured Max' and 'Injured Min' to numeric, converting non-numeric values to NaN
df['Injured Max'] = pd.to_numeric(df['Injured Max'], errors='coerce')

# Fill missing values in 'Injured Max' with the median
df['Injured Max'] = df['Injured Max'].fillna(df['Injured Max'].median())

# Ensure 'Injured Max' is at least equal to 'Injured Min'
df['Injured Max'] = df[['Injured Max', 'Injured Min']].max(axis=1)


## Cleaning and Standardising the 'Open/Closed Space' Column

This cell ensures consistency in the `'Open/Closed Space'` column by performing the following steps:

- **Retrieves unique values** to identify inconsistencies in formatting.
- **Standardises text formatting** by:
  - Removing leading and trailing spaces.
  - Converting all values to lowercase to ensure uniformity.
- **Replaces ambiguous entries**:
  - The category `'open/closed'` is converted to `NaN` to reflect missing or unclear data.
- **Displays the updated value counts**, including `NaN` values, to verify the cleaning process.


In [10]:
df["Open/Closed Space"].unique()

array(['Closed', 'Open', 'open', 'Open ', nan, 'closed', 'Open/Closed'],
      dtype=object)

In [11]:
df["Open/Closed Space"] = df["Open/Closed Space"].str.strip().str.lower()
df["Open/Closed Space"] = df["Open/Closed Space"].replace({"open/closed":np.nan})
df["Open/Closed Space"].value_counts(dropna=False)

Open/Closed Space
open      329
closed    127
NaN        36
Name: count, dtype: int64

## Checking and Filling Missing Temperature Data

This step ensures consistency in temperature values by identifying and addressing missing data:

- **Identifies rows where either `Temperature(C)` or `Temperature(F)` is missing** and displays them for review.
- **Uses the conversion function `C2F()`** to fill missing Fahrenheit values where the corresponding Celsius value is available.
- **Updates specific missing Fahrenheit values at index `475` and `490`**, ensuring accuracy while preserving the original dataset.
- **Leaves rows where both temperature values are missing untouched**, as further work can be done to estimate these values using average temperatures for the corresponding zone.

This approach ensures that temperature data remains complete where possible while highlighting areas that require additional imputation.

In [12]:
def C2F(celsius):
    return (celsius * 9/5) + 32
def F2C(fahrenheit):
    return (fahrenheit - 32) * 5/9

In [13]:
print(df[(df['Temperature(C)'].isna())| (df['Temperature(F)'].isna())][['Temperature(C)','Temperature(F)']])

     Temperature(C)  Temperature(F)
165             NaN             NaN
449             NaN             NaN
450             NaN             NaN
473             NaN             NaN
475            19.0             NaN
479             NaN             NaN
490            29.0             NaN


In [14]:
df.loc[475, 'Temperature(F)'] = C2F(df.loc[475, 'Temperature(C)'])
df.loc[490, 'Temperature(F)'] = C2F(df.loc[490, 'Temperature(C)'])

## Cleaning Cities and Provinces

#### Standardising City Names for Consistency

This cell ensures that city names in the dataset are formatted uniformly by applying the following transformations:

- **Converts all city names to lowercase** to maintain consistency and avoid case-sensitive mismatches.
- **Removes any leading or trailing whitespace** to eliminate unintentional variations.
- **Extracts all unique city names** from the dataset to identify inconsistencies.
- **Sorts the unique city names alphabetically** to facilitate easier review and verification.

These steps help standardise the data, making it easier to detect duplicates, match locations accurately, and perform further cleaning if necessary.


In [15]:
# convert to lower case
df['City'] = df['City'].str.lower()
# remove trailing white spaces
df['City'] = df['City'].str.strip()

# get all the unique values in the 'province' column
cities = df['City'].unique()
# sort them alphabetically and then take a closer look
cities.sort()
cities

array(['attock', 'bajaur agency', 'bannu', 'bhakkar', 'buner', 'chakwal',
       'chaman', 'charsadda', 'd. i khan', 'd.g khan', 'd.i khan',
       'dara adam khel', 'fateh jang', 'ghallanai, mohmand agency',
       'gujrat', 'hangu', 'haripur', 'hayatabad', 'islamabad',
       'jacobabad', 'karachi', 'karak', 'khanewal', 'khuzdar',
       'khyber agency', 'kohat', 'kuram agency', 'kurram agency',
       'lahore', 'lakki marwat', 'lasbela', 'lower dir', 'malakand',
       'mansehra', 'mardan', 'mohmand agency',
       'mosal kor, mohmand agency', 'multan', 'muzaffarabad',
       'north waziristan', 'nowshehra', 'orakzai agency', 'peshawar',
       'pishin', 'poonch', 'quetta', 'rawalpindi', 'sargodha',
       'sehwan town', 'shabqadar-charsadda', 'shangla', 'shikarpur',
       'sialkot', 'south waziristan', 'sudhanoti', 'sukkur', 'swabi',
       'swat', 'taftan', 'tangi, charsadda district', 'tank', 'taunsa',
       'tirah valley', 'totalai', 'upper dir', 'wagah', 'zhob'],
      dtype=

This cell ensures that city names are corrected and standardised by using fuzzy string matching to identify and replace similar variations. The process follows these steps:

- **Defines a set of standard city names** (`d.i khan`, `kurram agency`, `mohmand agency`, `charsadda`) to be used as reference.
- **Uses fuzzy matching (`thefuzz.process.extract`)** to find the closest matches for each standard name within the dataset.
- **Filters matches with a similarity score of 90 or higher**, ensuring only highly similar names are considered.
- **Replaces all identified variations with the corresponding standard name**, ensuring consistency in city names.

By applying this approach, variations in spelling, formatting, or abbreviations are resolved, making the dataset more uniform and easier to analyse.


In [16]:
standard_names = ["d.i khan", "kurram agency", "mohmand agency", "charsadda"]

for name in standard_names:
    # Find the closest matches with a score of 90 or higher
    matches = process.extract(name, cities, limit=5)
    close_matches = [match[0] for match in matches if match[1] >= 90]
    
    print(f"{name}: {close_matches}")  # Display detected similar names
    
    # Replace all close matches with the standard name
    df.loc[df['City'].isin(close_matches), 'City'] = name


d.i khan: ['d.i khan', 'd. i khan']
kurram agency: ['kurram agency', 'kuram agency']
mohmand agency: ['mohmand agency', 'ghallanai, mohmand agency', 'mosal kor, mohmand agency']
charsadda: ['charsadda', 'shabqadar-charsadda', 'tangi, charsadda district']


#### Standardising Province Names

A similar procedure is applied to ensure consistency in province names within the dataset. This involves:

- **Converting all names to lowercase** and **removing trailing spaces** to eliminate formatting inconsistencies.
- **Identifying and replacing variations** (e.g., alternate spellings) to align with a standard set of province names.
- **Ensuring uniformity** across the dataset, making it more reliable for analysis.

This process helps prevent mismatches and inconsistencies that could affect grouping, filtering, and interpretation of the data.


In [17]:
# convert to lower case
df['Province'] = df['Province'].str.lower()
# remove trailing white spaces
df['Province'] = df['Province'].str.strip()

# get all the unique values in the 'province' column
provinces = df['Province'].unique()
# sort them alphabetically and then take a closer look
provinces.sort()
provinces

array(['ajk', 'balochistan', 'baluchistan', 'capital', 'fata', 'kpk',
       'punjab', 'sindh'], dtype=object)

In [18]:
df["Province"] = df["Province"].replace({"baluchistan": "balochistan", 'capital':'ict', 'fata':'kpk'})

In [19]:
print(df['Province'].unique())

['ict' 'sindh' 'balochistan' 'punjab' 'kpk' 'ajk']


In [20]:
print(df['City'].unique())

['islamabad' 'karachi' 'quetta' 'rawalpindi' 'north waziristan' 'kohat'
 'attock' 'sialkot' 'lahore' 'swat' 'hangu' 'bannu' 'lasbela' 'malakand'
 'peshawar' 'd.i khan' 'lakki marwat' 'tank' 'gujrat' 'charsadda'
 'kurram agency' 'shangla' 'bajaur agency' 'south waziristan' 'haripur'
 'sargodha' 'nowshehra' 'mohmand agency' 'dara adam khel' 'khyber agency'
 'mardan' 'bhakkar' 'orakzai agency' 'buner' 'd.g khan' 'pishin' 'chakwal'
 'upper dir' 'muzaffarabad' 'totalai' 'multan' 'lower dir' 'sudhanoti'
 'poonch' 'mansehra' 'karak' 'swabi' 'shikarpur' 'sukkur' 'chaman'
 'khanewal' 'fateh jang' 'taftan' 'tirah valley' 'wagah' 'zhob' 'taunsa'
 'jacobabad' 'khuzdar' 'hayatabad' 'sehwan town']


In [21]:
pakistan_csv_file=os.path.join(CLEAN_DATA_DIR, 'pakistan_clean.csv')
df.to_csv(pakistan_csv_file)

## Examining Categorical Data  

To ensure consistency and detect potential issues, each column containing textual or categorical data will be reviewed. This step helps identify irregularities such as inconsistent formatting, unexpected values, or missing entries that may require standardisation or correction.  

### `'Blast Day Type'`: Validating and Correcting 

To ensure consistency in the dataset, the `"Blast Day Type"` column is reviewed and updated where necessary. If the value is missing, it is assigned based on the corresponding date. Rows marked as `"Holiday"` remain unchanged, while any incorrect classifications are corrected to either `"Working Day"` or `"Weekend"` in alignment with the actual day of the week.  

In [22]:
df['Blast Day Type'].unique()

array(['Holiday', 'Working Day', nan, 'Weekend'], dtype=object)

In [23]:
# Define a function to determine the correct Blast Day Type
def correct_blast_day_type(row):
    if pd.isna(row['Blast Day Type']):  # If 'Blast Day Type' is NaN, assign based on the date
        return 'Weekend' if row['Date'].weekday() >= 5 else 'Working Day'
    elif row['Blast Day Type'] == 'Holiday':  # If 'Blast Day Type' is 'Holiday', keep it unchanged
        return 'Holiday'
    else:  # Otherwise, validate and correct it
        correct_type = 'Weekend' if row['Date'].weekday() >= 5 else 'Working Day'
        return correct_type if row['Blast Day Type'] != correct_type else row['Blast Day Type']

# Apply the function to correct the 'Blast Day Type' column
df['Blast Day Type'] = df.apply(correct_blast_day_type, axis=1)


### `'Holiday Types'`: Standardising

In order to maintain consistency across the dataset, a few changes are made to the `'Holiday Type'` column:
- All instances of `'Weekend'` are reclassified as `'General Holiday'`.
- The values `'Ashura'` and `'Ashura Holiday'` are merged into a single category, `'Ashura'`.
- Variations of `'Christmas/ birthday of Quaid-e-Azam'` are standardised to a consistent spelling, `'Christmas/birthday of Quaid-e-Azam'`.

These adjustments ensure that the holiday types are represented uniformly, making the data cleaner and more accurate for analysis.


In [24]:
df['Holiday Type'].unique()

array(['Weekend', nan, 'Christmas/birthday of Quaid-e-Azam', 'Ashura',
       'Eid Milad un-Nabi', 'Iqbal Day', 'Eid-ul-azha', 'Labour Day',
       'Eid-ul-Fitar', 'Pakistan Day', 'Defence Day',
       'Christmas/ birthday of Quaid-e-Azam', 'General Elections',
       'Eid Holidays', 'Ashura Holiday', 'Eid ul Azha Holiday'],
      dtype=object)

In [25]:
# Replace 'Weekend' with 'General Holiday'
df['Holiday Type'] = df['Holiday Type'].replace('Weekend', 'General Holiday')
# Merge 'Ashura' and 'Ashura Holiday'
df['Holiday Type'] = df['Holiday Type'].replace({'Ashura Holiday': 'Ashura'})
# Merge 'Christmas/birthday of Quaid-e-Azam' values
df['Holiday Type'] = df['Holiday Type'].replace({
    'Christmas/ birthday of Quaid-e-Azam': 'Christmas/birthday of Quaid-e-Azam'
})


### `'Location'`: Review

In [26]:
len(df['Location'].unique())

484

The `'Location'` column contains **484 unique values**. Given the high number of unique locations, it may not be practical to correct all of them manually. In such cases, it is important to assess whether standardisation or consolidation of values would significantly impact the analysis. If the column is not central to the project's objectives or if correcting the entries would introduce more complexity than it’s worth, we might choose to leave the column as it is, while still monitoring for any obvious inconsistencies or issues.

### `'Location Category'`: Standardising

To ensure consistency across the dataset, several changes are made to the `'Location Category'` column:
- The values `'foreign'` and `'Foreign'` are standardised to `'Foreigner'`.
- `'Government Official'` and `'Government/Office Building'` are merged into the broader category `'Government'`.
- `'Residential Building'` and `'Residence'` are grouped under the category `'Residential'`.
- `'Commercial/residence'` is standardised to `'Commercial'`.
- The value `'Mobile'` is merged with `'Transport'`.
- Blank spaces and the string `'nan'` are converted to `np.nan` to handle missing or invalid entries.

These changes ensure that location categories are consistent, making the dataset more uniform and ready for analysis.

In [27]:
df['Location Category'].unique()

array(['Foreign', 'Office Building', 'Hotel', 'Religious', 'Mobile',
       'Military', 'Residence', 'Park/Ground', 'Market', 'Police',
       'Airport', 'Government', 'Transport', 'Hospital', 'Educational',
       'Civilian', 'Bank', 'Foreigner', 'Government/Office Building',
       'Government Official', 'foreign', nan, 'Residential Building',
       'Commercial/residence', ' '], dtype=object)

In [28]:
# Standardise values in 'Location Category' column
df['Location Category'] = df['Location Category'].replace({
    'foreign': 'Foreigner',  # Standardising case inconsistency
    'Foreign': 'Foreigner',  # Standardising 'Foreign' to 'Foreigner'
    'Government Official': 'Government',  # Grouping 'Government Official' with 'Government'
    'Government/Office Building': 'Government',  # Grouping 'Government/Office Building' with 'Government'
    'Residential Building': 'Residence',  # Grouping 'Residential Building' with 'Residence'
    'Residence': 'Residential',  # Standardising 'Residence' to 'Residential'
    'Commercial/residence': 'Commercial',  # Grouping 'Commercial/residence' with 'Commercial'
    'Mobile': 'Transport',  # Merging 'Mobile' with 'Transport'
    ' ': np.nan,  # Converting blank spaces to NaN
    'nan': np.nan  # Converting 'nan' (string) to NaN
})

df['Location Category'].value_counts(dropna=False)

Location Category
Police             90
Transport          75
Military           69
Religious          57
Market             40
NaN                36
Park/Ground        32
Residential        25
Government         21
Hotel              10
Office Building     9
Foreigner           8
Educational         6
Hospital            5
Bank                4
Commercial          2
Airport             1
Civilian            1
Residence           1
Name: count, dtype: int64

### `'Location Sensitivity'`: Standardising 

To ensure consistency in the `'Location Sensitivity'` column, the following adjustments are made:
- All values are converted to lowercase to eliminate any case inconsistencies.
- Leading and trailing spaces are removed from the entries to ensure uniformity.

These steps help standardise the column, making the data cleaner and more consistent for analysis.

In [29]:
df['Location Sensitivity'].value_counts(dropna=False)

Location Sensitivity
High      264
Low       115
Medium     74
NaN        36
low         3
Name: count, dtype: int64

In [30]:
# Convert all values in 'Location Sensitivity' to lowercase to standardise the case
df['Location Sensitivity'] = df['Location Sensitivity'].str.lower()
# Remove any leading or trailing spaces from the 'Location Sensitivity' values
df['Location Sensitivity'] = df['Location Sensitivity'].str.strip()

### `'Influencing Event/Event'`: Review

In [31]:
len(df['Influencing Event/Event'].unique())

170

The `'Influencing Event/Event'` column contains **170 unique values**. Given the high number of distinct events, it is decided that this column will not be changed. The variety in values is important for the analysis, and modifying them may lead to the loss of valuable distinctions.

### `'Target Type'`: Standardising 

The following values in the `'Target Type'` column are merged to standardise the data:
- 'civilian' is merged with 'Civilian'.
- 'Government official' is standardised to 'Government Official'.
- 'Army' is merged with 'Military'.
- Variations of 'religious' are standardised to 'Religious', including 'Shia sect'.
- 'foreigner' is merged with 'Foreigner'.
- Variations of 'police', including 'Police & Rangers' and 'Rangers', are consolidated under 'Police'.
- 'advocates (lawyers)' and 'Civilian Judges' are merged into 'Judges & lawyers'.
- 'Children/Women' is merged into 'Civilian'.

These changes help ensure consistency in the data, making it easier to analyse and interpret.

In [32]:
df['Target Type'].unique()

array(['Foreigner', 'Media', 'Religious', 'Military',
       'Government Official', 'civilian', 'Civilian', 'Police',
       'Government official', 'police', 'Children/Women',
       'Anti-Militants', 'foreigner', 'Unknown', 'religious', nan,
       'Rangers', 'Police & Rangers', 'Civilian & Police', 'Army',
       'Frontier Corps ', 'advocates (lawyers)', 'Civilian Judges',
       'Shia sect', 'Judges & lawyers'], dtype=object)

In [33]:
# Standardise and merge values in the 'Target Type' column
df['Target Type'] = df['Target Type'].replace({
    'civilian': 'Civilian',  # Merging case inconsistency for 'civilian'
    'Government official': 'Government Official',  # Merging case inconsistency
    'Army': 'Military',  # Merging 'Army' with 'Military'
    'religious': 'Religious',  # Merging case inconsistency
    'foreigner': 'Foreigner',  # Merging case inconsistency
    'police': 'Police',  # Merging case inconsistency for 'police'
    'Police & Rangers': 'Police',  # Merging 'Police & Rangers' with 'Police'
    'Rangers': 'Police',  # Merging 'Rangers' with 'Police'
    'advocates (lawyers)': 'Judges & lawyers',  # Merging with 'Judges & lawyers'
    'Civilian Judges': 'Judges & lawyers',  # Merging with 'Judges & lawyers'
    'Children/Women': 'Civilian',  # Merging 'Children/Women' with 'Civilian'
    'Shia sect': 'Religious',  # Merging 'Shia sect' with 'Religious'
    'Unknown': np.nan
})


### `'Targeted Sect if any'`: Standardising

The following values in the `'Targeted Sect if any'` column are merged to standardise the data:
- 'shiite' is standardised to 'Shiite' to correct case inconsistency.
- 'Shiite/sunni' is merged into 'Shiite/Sunni' to represent the mixed reference to both sects.

These changes help ensure consistency in the data, making it cleaner and more interpretable for analysis.




In [34]:
df['Targeted Sect if any'].value_counts()

Targeted Sect if any
Shiite          38
Sunni           38
Christian        9
shiite           9
Shiite/sunni     1
Jews             1
Ahmedi           1
Name: count, dtype: int64

In [35]:
# Standardise and merge values in the 'Religion' or 'Sect' column
df['Targeted Sect if any'] = df['Targeted Sect if any'].replace({
    'shiite': 'Shiite',  # Merging case inconsistency for 'Shiite'
    'Shiite/sunni': 'Shiite/Sunni',  # Merging 'Shiite/sunni' to 'Mixed' or consider another approach
})


### `'Hospital Names'`: Reviewing 

The `'Hospital Names'` column contains **237 unique values**, and while no changes will be made to this column at this stage, it could prove useful to standardise or clean it if required for future analysis. In some cases, multiple hospital names appear in a single entry, which could lead to further data cleaning or merging in the future, depending on the analysis needs. For now, the column will remain unchanged, but it may be revisited later if the need arises.

In [36]:
len(df['Hospital Names'].unique())

237

## Converting Columns to Categorical Data Types

To optimise the dataset for analysis, several columns have been converted to categorical data types. This helps improve memory usage and performance when filtering, grouping, or aggregating data. The following columns have been categorised:

- `'Blast Day Type'` with three categories: `Working Day`, `Weekend`, and `Holiday`.
- `'Holiday Type'`, which contains multiple holiday types, enabling efficient grouping.
- `'City'` with 61 unique cities.
- `'Province'` with 6 unique provinces.
- `'Location Category'` with 17 unique categories.
- `'Location Sensitivity'` has been converted to an ordered categorical type with three levels: `low < medium < high`.
- `'Target Type'` with 11 unique target types.
- `'Targeted Sect if any'` with 6 unique sects.
- `'Open/Closed Space'` with two possible values: `open` and `closed`.

These conversions ensure the data is well-structured for analysis and will lead to more efficient processing when working with the dataset.


In [37]:
# Convert columns to categorical data types
df['Blast Day Type'] = df['Blast Day Type'].astype('category')  # 3 categories (Working Day, Weekend, Holiday)
df['Holiday Type'] = df['Holiday Type'].astype('category')  # Multiple holiday types, useful for grouping
df['City'] = df['City'].astype('category')  # 61 unique cities, converting to categorical
df['Province'] = df['Province'].astype('category')  # 6 unique provinces
df['Location Category'] = df['Location Category'].astype('category')  # 17 unique categories
df['Location Sensitivity'] = pd.Categorical(df['Location Sensitivity'], categories=['low', 'medium', 'high'], ordered=True)  # Ordered categories (low < medium < high)
df['Target Type'] = df['Target Type'].astype('category')  # 11 unique target types
df['Targeted Sect if any'] = df['Targeted Sect if any'].astype('category')  # 6 unique sects
df['Open/Closed Space'] = df['Open/Closed Space'].astype('category')  # 2 values (open, closed)


In [39]:
display(df.describe(include= ['category','object']))

Unnamed: 0,Islamic Date,Blast Day Type,Holiday Type,Time,City,Province,Location,Location Category,Location Sensitivity,Open/Closed Space,Influencing Event/Event,Target Type,Targeted Sect if any,Explosive Weight (max),Hospital Names
count,336,492,72,281,492,492,489,456,456,456,187,465,97,169,294
unique,301,3,13,216,61,6,483,18,3,2,169,11,6,112,236
top,5 Rajab 1428 A.H,Working Day,General Holiday,Evening,peshawar,kpk,Imambargah,Police,high,open,Sectarian strife,Military,Shiite,6kg,Lady Reading Hospital
freq,3,350,45,12,74,323,3,90,264,329,7,117,47,8,20


In [40]:
df['Explosive Weight (max)'].unique()

array([nan, '2.5 Kg', '30kg in each car', '2kg', '10-15 kg', '5 kg',
       '10 kg ', '5 to 6 Kg', '10 to 15 kg', 'more than 5Kg', '5 to 6 kg',
       '15 kg', '3 kg explosive and 3kg ball bearings', '6-7kg', '4-5kg',
       '50kg', '30kg ', '100Kg', '150Kg', '1000 Kg', ' 8 Kg', '60Kg',
       '8 ft deep crater', '40 kg', '7 or 8Kg', ' 10 kg ', '8Kg', '100kg',
       '100', '85', '125 kg', '5kg', '160', '500 Kg', '24Kg', '10Kg',
       '20 Kg', '12Kg', '12 Kg', '150 kg ', '180Kg', '100 Kg', '70 kg ',
       '16 Kg', '15Kg', '10 Kg', '7Kg ', '400Kg', '200Kg', '50-60Kg',
       '250Kg', '1000 KG', '250 kg', '600 kg', '40Kg', '5Kg', ' 12 kg',
       '600Kg', '12 kg', '14 kg', '8 kg', '300 kg', '20 kg ', '10 kg',
       '20Kg', ' 6 kg', '1200Kg', '12 kg ', '1000 Kg ', '400 KG',
       '1000 kg', '7Kg', '9Kg', '16Kg', '250kg', '8kg', '180kg',
       '16-20kg', '300kg', '450kg', '7kg', '60kg', '500 kg', '5-6kg',
       '3kg', '6kg', '6Kg', '100-110kg', '30kg', '7kg-8kg', '100kg-150kg',
     

In [61]:
df['Explosive Weight (max)'].isna().sum()

np.int64(355)

In [59]:
# Replace variations of 'kg' (e.g., 'kg', 'Kg', 'KG') with an empty string ''
df['Explosive Weight (max)'] = df['Explosive Weight (max)'].str.strip()
df['Explosive Weight (max)'] = df['Explosive Weight (max)'].str.replace(r'[kK][gG]$', '', regex=True)
df['Explosive Weight (max)'] = df['Explosive Weight (max)'].str.strip()

replacement_dict = {
    '30kg in each car': 30,
    '10-15': 12.5,
    '7-8': 7.5,
    '100-110': 105,
    '5 to 6': 5.5,
    '10 to 15': 12.5,
    '5 to 6 Kg': 5.5,
    '20-25': 22.5,
    '25-30': 27.5,
    '4-5': 4.5,
    '50-60': 55,
    '6-7': 6.5,
    '3 kg explosive and 3kg ball bearings': 6,  # Assuming the weight is 6kg
    '7kg-8': 7.5,
    '100kg-150': 125,
    '600kg-800': 700,
    '4kg-5': 4.5,
    '1000kg': 1000,
    '7-8kg': 7.5,
    '4kg-6': 5,  # Average of 4-6
    '16-20kg': 18,  # Average of 16-20
    '600': 600,
    '12kg': 12,
    '8 ft deep crater': np.nan,  # Replace with NaN for this specific case
    'more than 5':5,
    '7 or 8':7.8,
    '5-6':5.5,
    '16-20':18
}

df['Explosive Weight (max)'] = df['Explosive Weight (max)'].replace(replacement_dict)
df['Explosive Weight (max)'] = pd.to_numeric(df['Explosive Weight (max)'], errors='coerce')
df['Explosive Weight (max)'].unique()

array([   nan,    2.5,    2. ,    5. ,   10. ,   15. ,   50. ,   30. ,
        100. ,  150. , 1000. ,    8. ,   60. ,   40. ,   85. ,  125. ,
        160. ,  500. ,   24. ,   20. ,   12. ,  180. ,   70. ,   16. ,
          7. ,  400. ,  200. ,  250. ,   14. ,  300. ,    6. , 1200. ,
          9. ,  450. ,    3. ,   45. ,   13. ,   75. ,   25. ])

In [64]:
df['Explosive Weight (max)'].value_counts(sort=False)

Explosive Weight (max)
2.5        1
2.0        2
5.0        7
10.0      16
15.0       9
50.0       3
30.0       2
100.0      7
150.0      2
1000.0     6
8.0       14
60.0       4
40.0       2
85.0       1
125.0      1
160.0      1
500.0      2
24.0       1
20.0       5
12.0       9
180.0      2
70.0       1
16.0       2
7.0        3
400.0      3
200.0      1
250.0      5
14.0       2
300.0      4
6.0       11
1200.0     1
9.0        1
450.0      1
3.0        1
45.0       1
13.0       1
75.0       1
25.0       1
Name: count, dtype: int64