## Preamble
_______

In [4]:
!pip install -r ../requirements.txt 

Defaulting to user installation because normal site-packages is not writeable


In [5]:
import pandas as pd
import numpy as np
import re


from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# This is just to make the presentation cleaner, but warnings need to be analyzed and fixed.
# But I dint have time for it. 
import warnings
warnings.filterwarnings('ignore')

import chardet

np.random.seed(0)

%run functions.ipynb

## Data
________


**TODO**

* Download dataset from [here](https://www.kaggle.com/zusmani/pakistansuicideattacks/download).
* Identify the encoding of the data in `filename`
* Read the csv into `suicide_attacks` variable using the correct encoding (the `chardet` module might come handy).

In [6]:
BASE_PATH = '/home/matias/data/PakistanSuicideAttacks/'

list_suffix = [
                                      'PakistanSuicideAttacks Ver 6 (10-October-2017).csv',
                                      'PakistanSuicideAttacks Ver 11 (30-November-2017).csv'
                                      ] 
    
filename = f'{BASE_PATH}/{list_suffix[0]}'
encoding = get_most_likely_encoding(filename)
encoding

'Windows-1252'

In [7]:
suicide_attacks = build_df_from_appending_csv_files(
                        list_suffix = list_suffix,
                        BASE_PATH = BASE_PATH,
                        encoding = encoding,
                        clean_headers = False
                        ) 

Reading: PakistanSuicideAttacks Ver 6 (10-October-2017).csv
shape: (492, 26)
columns: ['S#', 'Date', 'Islamic Date', 'Blast Day Type', 'Holiday Type', 'Time', 'City', 'Latitude', 'Longitude', 'Province', 'Location', 'Location Category', 'Location Sensitivity', 'Open/Closed Space', 'Influencing Event/Event', 'Target Type', 'Targeted Sect if any', 'Killed Min', 'Killed Max', 'Injured Min', 'Injured Max', 'No. of Suicide Blasts', 'Explosive Weight (max)', 'Hospital Names', 'Temperature(C)', 'Temperature(F)']
pc_number_of_null_values: 0.174406

Reading: PakistanSuicideAttacks Ver 11 (30-November-2017).csv
shape: (496, 26)
columns: ['S#', 'Date', 'Islamic Date', 'Blast Day Type', 'Holiday Type', 'Time', 'City', 'Latitude', 'Longitude', 'Province', 'Location', 'Location Category', 'Location Sensitivity', 'Open/Closed Space', 'Influencing Event/Event', 'Target Type', 'Targeted Sect if any', 'Killed Min', 'Killed Max', 'Injured Min', 'Injured Max', 'No. of Suicide Blasts', 'Explosive Weight (m

## Preliminary text pre-processing
___

**TODO**

* Clean the `City` column for inconsisntecies
* Normalize the `City` column for upper or lowercase, spaces, etc.

In [8]:
get_all_the_unique_values_in_the_city_column_sort_and_print(suicide_attacks)

['ATTOCK' 'Attock ' 'Bajaur Agency' 'Bannu' 'Bhakkar ' 'Buner' 'Chakwal '
 'Chaman' 'Charsadda' 'Charsadda ' 'D. I Khan' 'D.G Khan' 'D.G Khan '
 'D.I Khan' 'D.I Khan ' 'Dara Adam Khel' 'Dara Adam khel' 'Fateh Jang'
 'Ghallanai, Mohmand Agency ' 'Gujrat' 'Hangu' 'Haripur' 'Hayatabad'
 'Islamabad' 'Islamabad ' 'Jacobabad' 'KURRAM AGENCY' 'Karachi' 'Karachi '
 'Karak' 'Khanewal' 'Khuzdar' 'Khyber Agency' 'Khyber Agency ' 'Kohat'
 'Kohat ' 'Kuram Agency ' 'Lahore' 'Lahore ' 'Lakki Marwat' 'Lakki marwat'
 'Lasbela' 'Lower Dir' 'MULTAN' 'Malakand ' 'Mansehra' 'Mardan'
 'Mohmand Agency' 'Mohmand Agency ' 'Mohmand agency'
 'Mosal Kor, Mohmand Agency' 'Multan' 'Muzaffarabad' 'North Waziristan'
 'North waziristan' 'Nowshehra' 'Orakzai Agency' 'Peshawar' 'Peshawar '
 'Pishin' 'Poonch' 'Quetta' 'Quetta ' 'Rawalpindi' 'Sargodha'
 'Sehwan town' 'Shabqadar-Charsadda' 'Shangla ' 'Shikarpur' 'Sialkot'
 'South Waziristan' 'South waziristan' 'Sudhanoti' 'Sukkur' 'Swabi '
 'Swat' 'Swat ' 'Taftan' 'Tangi, 

In [9]:
# Normalize columns does things like these:
# 'ATTOCK', 'Attock ' ---> 'attock'
# 'D. I Khan', 'D.I Khan ' ---> 'd.i khan' 

suicide_attacks = normalize_column(suicide_attacks, 'City')

Casted to lowercase ✓
Stripped left and right whitespaces ✓
Stripped whitespaces after dots ✓
Cells with two entries were splitted, and second entry placed into a new column: second_City. ✓


## Matching of inconsistent data entries
___

**TODO** 

* Verify there are no more inconsistencies in the `City` column.
* Feel free to use the [`fuzzywuzzy`](https://github.com/seatgeek/fuzzywuzzy) package to match an remove possible issues.

> **Fuzzy matching:** The process of automatically finding text strings that are very similar to the target string. In general, a string is considered "closer" to another one the fewer characters you'd need to change if you were transforming one string into another. So "apple" and "snapple" are two changes away from each other (add "s" and "n") while "in" and "on" and one change away (rplace "i" with "o"). You won't always be able to rely on fuzzy matching 100%, but it will usually end up saving you at least a little time.

In [10]:
# I didnt have time to explore the fuzzywuzzy, but will definetely have a look at it! 
# Otherwise, we can visually check that the inconsistencies are gone. 
get_all_the_unique_values_in_the_city_column_sort_and_print(suicide_attacks)

['attock' 'bajaur agency' 'bannu' 'bhakkar' 'buner' 'chakwal' 'chaman'
 'charsadda' 'd.g khan' 'd.i khan' 'dara adam khel' 'fateh jang'
 'ghallanai' 'gujrat' 'hangu' 'haripur' 'hayatabad' 'islamabad'
 'jacobabad' 'karachi' 'karak' 'khanewal' 'khuzdar' 'khyber agency'
 'kohat' 'kuram agency' 'kurram agency' 'lahore' 'lakki marwat' 'lasbela'
 'lower dir' 'malakand' 'mansehra' 'mardan' 'mohmand agency' 'mosal kor'
 'multan' 'muzaffarabad' 'north waziristan' 'nowshehra' 'orakzai agency'
 'peshawar' 'pishin' 'poonch' 'quetta' 'rawalpindi' 'sargodha'
 'sehwan town' 'shabqadar-charsadda' 'shangla' 'shikarpur' 'sialkot'
 'south waziristan' 'sudhanoti' 'sukkur' 'swabi' 'swat' 'taftan' 'tangi'
 'tank' 'taunsa' 'tirah valley' 'totalai' 'upper dir' 'wagah' 'zhob']
