# Police Radio Analysis: Data Collection
*** 

### Contents:
- [Overview](#Overview)
- [Functions](#Functions)
- [Dataframe by Word](#Data-Collection:-Single-Word-Observation)
- [Dataframe by Sentence](#Data-Collection:-Sentence-by-Speaker-Observation)
- [Dataframe by Location (for Mapping)](#Data-Collection:-Threatened-Locations)

### Overview

- Audio Files: Broadcastify archives for 11/08/2018 between 6:44AM-9:12AM in Butte County, CA (The Camp Fire)


- Used Amazon Transcribe to transcribe audio to text (output JSON file)


- Steps to convert JSON files to a structured dataframes:
    1. Take word and speaker information from individual dataframe to create two dataframes
    2. Merge the two dataframes by assigning the speaker to the word based on the time
    3. Based on speaker and start/end times reconstruct sentences (new speaker signifies new observation)
    4. Repeat for every JSON file and combine all into on dataframe
    5. Add additional desired columns based on current information in dataframe:
        - Clean version of text (no punctuation, all lowercase)
        - Start and stop time as datetime objects using the initial start time in the file name
        - Indicator for containing 'fire' or 'evacuation' in the observations
    6. Sort dataframe by the start time (datetime object) and export
    7. Use existing dataframe and to make a new dataframe where each identified location is one observation


- Output: 3 Dataframes
    - Dataframe by Individual Words
    - Dataframe by Sentences
    - Dataframe by Locations Mentioned (for Mapping)

*Import Libraries*

In [1]:
import pandas as pd
import numpy as np
import geopandas as gpd
import json
import pytz
import glob
import re
from pandas.io.json import json_normalize
from datetime import datetime, timedelta

ModuleNotFoundError: No module named 'geopandas'

**NOTE: Requires Docker Environment for GeoPandas**

#### Load Geographical Data for Mapping

In [2]:
## Loads geographic data from https://data.ca.gov/dataset/ca-geographic-boundaries
ca_places = gpd.read_file('./data/CA_places/CA_Places_TIGER2016.shp')
ca_state = gpd.read_file('./data/CA_State/CA_State_TIGER2016.shp')
ca_counties = gpd.read_file('./data/CA_Counties/CA_Counties_TIGER2016.shp')

## Reset INTPTLAT & INTPTLON to floats
ca_places['INTPTLAT'] = ca_places['INTPTLAT'].astype('float')
ca_places['INTPTLON'] = ca_places['INTPTLON'].astype('float')
ca_state['INTPTLAT'] = ca_state['INTPTLAT'].astype('float')
ca_state['INTPTLON'] = ca_state['INTPTLON'].astype('float')
ca_counties['INTPTLAT'] = ca_counties['INTPTLAT'].astype('float')
ca_counties['INTPTLON'] = ca_counties['INTPTLON'].astype('float')

## Reset places dataframe to capital NAME
for i, row in enumerate(ca_places.iterrows()):
    ca_places.loc[i,'NAME'] = row[1]['NAME'].upper()

## Functions

In [3]:
# Get 2 dataframes (one for words, one for speakers) from JSON
def transcription_outputs(file_name):
    
    # Read in the JSON file
    with open('./' + file_name,'r') as read_file:
         data = json.load(read_file)
    
    # Get Dataframe 1: Words
    words = data['results']['items']
    words = json_normalize(words,
              record_path='alternatives',
              meta=['end_time','start_time','type'],
              errors='ignore')
    words = words[['content','confidence','start_time','end_time','type']]
    words['feed'] = data['jobName']

    # Get Dataframe 2: Speaker
    speaker_labels = data['results']['speaker_labels']
    speaker_turns = json_normalize(speaker_labels,
                      record_path='segments',
                      meta='speakers',
                      errors='ignore')
    speaker_turns = speaker_turns[['start_time', 'end_time', 'items', 'speaker_label', 'speakers']]
    
    return words, speaker_turns

In [4]:
# Append speaker to words dataframe based on time
def append_speaker(items_df, speakers_df):
    
    # Set default values
    items_df['speaker_start'] = np.nan
    items_df['speaker_end'] = np.nan
    items_df['sentence'] = np.nan
    items_df['speaker'] = np.nan
    
    # Reset index
    items_df.reset_index(inplace=True, drop=True) 
    
    # Format start time in both columns as type float
    items_df['start_time'] = items_df['start_time'].astype('float') ### MAY NOT NEED IF ALREADY SET
    speakers_df['start_time'] = speakers_df['start_time'].astype('float') ### MAY NOT NEED IF ALREADY SET
    
    # Iterate through rows in speaker df to append to items_df (dataframe of words)
    sentence = 0
    for i, sp_row in enumerate(speakers_df.iterrows()):
        start_time = sp_row[1][0] ### Make sure index locs are the same
        end_time = sp_row[1][1] ### Make sure index locs are the same
        speaker = sp_row[1][3] ### Make sure index locs are the same
        for i, it_row in enumerate(items_df[items_df['start_time'] >= start_time].iterrows()):
            if it_row[1][3] >= end_time: ### Make sure index locs are the same
                items_df.loc[it_row[0],'speaker'] = speaker
                items_df.loc[it_row[0],'speaker_start'] = start_time
                items_df.loc[it_row[0],'speaker_end'] = end_time
                items_df.loc[it_row[0],'sentence'] = sentence
                sentence += 1
                break
            else:
                items_df.loc[it_row[0],'speaker'] = speaker
                items_df.loc[it_row[0],'speaker_start'] = start_time
                items_df.loc[it_row[0],'speaker_end'] = end_time
                items_df.loc[it_row[0],'sentence'] = sentence
    
    # Fill in in values for the punctuation rows with the values from the previous text observation
    items_df['end_time'] = items_df['end_time'].map(lambda x: np.nan if x is 'nan' else float(x))
    items_df['speaker_end'] = items_df['speaker_end'].astype(float)
    for column in ['start_time', 'end_time', 'speaker_start', 'speaker_end', 'sentence', 'speaker']:
        items_df[column].fillna(method = 'ffill', inplace = True)

    # Fix confidence - replace None with np.nan and change values to float (from string)
    items_df['confidence'] = items_df['confidence'].map(lambda x: np.nan if x is None else float(x))
    
    return items_df

In [5]:
# Returns list of locations from our pre-defined list present in a single observation
def get_location(text):
    
    # Pre-defined location list
    location_list = ['CHICO', 'PARADISE', 'OROVILLE', 'MAGALIA', 'THERMALITO', 'GRIDLEY', 'DURHAM', 'PALERMO', 'RIDGE', 'BIGGS', 
                        'COHASSET', 'BERRY-CREEK', 'FOREST-RANCH', 'BUTTE-CREEK-CANYON', 'BUTTE-VALLEY', 'COHASSET', 'CONCOW', 'BANGOR', 
                        'HONCUT', 'YANKEE-HILL', 'FORBESTOWN', 'NORD', 'PUGLIA', 'STIRLING-CITY', 'RICHVALE', 'RACKERBY', 'BERRY-CREEK-RANCHERIA', 
                        'CLIPPER-MILLS', 'ROBINSON-MILL', 'CHEROKEE', 'BUTTE-MEADOWS', 'ENTERPRISE-RANCHERIA']
    
    # Get list of locatoins mentioned
    locations_mentioned = []
    for location in location_list:
        if location.lower() in text.lower():
            locations_mentioned.append(location.replace('-',' '))
    
    return locations_mentioned

In [6]:
# Maps a list of the potential location matches from the CA_places dataframe and appends lat & long data
def label_centroids(df, ca_places):
    df['INTPTLON'] = 'None'
    df['INTPTLAT'] = 'None'
    df['ID_PLACES'] = 'None'
    for i, row in enumerate(df.iterrows()):
        intptlon = []
        intptlat = []
        id_places = []
        for loc in row[1]['location']:
            if loc in ca_places['NAME'].tolist():
                intptlon.extend(ca_places[ca_places['NAME'] == loc]['INTPTLON'])
                intptlat.extend(ca_places[ca_places['NAME'] == loc]['INTPTLAT'])
                id_places.extend(ca_places[ca_places['NAME'] == loc]['NAME'])
        df.at[i,'INTPTLON'] = intptlon
        df.at[i,'INTPTLAT'] = intptlat
        df.at[i,'ID_PLACES'] = id_places
    return df

In [7]:
# Maps file name to actual feed names (5 different feeds used)
def remap_feed(filename_str):
    
    # Dictionary to map feed numbers to feed bames
    feeds = {
        '1929'  : "Butte_Sheriff_Fire__Paradise_Police",
        '22956' : "Chico_Paradise_Fire__CalFire",
        '25641' : "Chico_Police_Dispatch",
        '24574' : "Oroville_Fire",
        '26936' : "Oroville_Police_Fire"
    }
    f = filename_str
    code = f[f.rfind('-')+1:-1]
    
    return feeds[code]

In [8]:
# Indicate which department the feed is associated with (Fire or Police)
def police_fire(feed_name):
    
    # Feed name to uppercase
    fnu = feed_name.upper()
    
    # Look for 'Fire' and 'Police' in feed name
    if 'FIRE' in fnu and 'POLICE' in fnu:
        return 'BOTH'
    elif 'FIRE' in fnu:
        return 'FIRE'
    elif 'POLICE' in fnu:
        return 'POLICE'
    else:
        return 'FEED_NAME_ERROR'

In [9]:
# Changes start time to the actual time (datetime object) based on thes start time of the entire feed
def actual_time_str(timecode_str, filename_str):
    
    filename_str = filename_str.split('/')[-1]

    # Get year, month, day, hour and minutes from file name
    YYYY = filename_str[:4]
    MM = filename_str[4:6]
    DD = filename_str[6:8]
    hh = filename_str[8:10]
    mm = filename_str[10:12]
    ssssmmmm  = float(timecode_str) # added via timedelta
    
    # adjust hours (file name is 2 hours behind actual time)
    hh_adj = str(int(hh)-2).zfill(2)
   
    # Create date time object 
    dt_str = ''.join([YYYY,MM,DD,hh_adj,mm])
    dt_naive = datetime.strptime(dt_str, '%Y%m%d%H%M')
    tz_pac = pytz.timezone('US/Pacific')
    dt_pac_0 = tz_pac.localize(dt_naive)
    dt_pac_actual = dt_pac_0 + timedelta(seconds = int(ssssmmmm))
                                     
    return dt_pac_actual

In [10]:
# Take individual words and reconstruct into a sentence based on the time and speaker
def sentence_reconstruction(items_spkr_appnd, ca_places):
    
    # Create empty dataframe with desired columns
    df = pd.DataFrame(data='', index=[0], columns=['text','speaker_start','speaker_end','speaker_length','speaker', 
                                                   'sentence', 'word_confidence','avg_confidence','min_conf','feed'])

    # Iterate through items dataframe to construct sentences
    index_set = 0
    confidence = []
    text = ''
    previous_speaker = items_spkr_appnd['speaker'][0]
    for i, speaker in enumerate(items_spkr_appnd['speaker']):
        if speaker != previous_speaker:
            df.loc[index_set,:] = {
                'speaker_start' : items_spkr_appnd.loc[i,'speaker_start'],
                'speaker_end' : items_spkr_appnd.loc[i,'speaker_end'],
                'speaker' : items_spkr_appnd.loc[i,'speaker'],
                'sentence' : items_spkr_appnd.loc[i,'sentence'],
                'text' : text,
                'word_confidence' : confidence,
                'min_conf' : min(confidence),
                'feed' : items_spkr_appnd.loc[i,'feed']
            }
            index_set += 1
            text = ''
            confidence = []
        text += str(items_spkr_appnd.loc[i, 'content'] + ' ')
        confidence.append(items_spkr_appnd.loc[i,'confidence'])
        previous_speaker = speaker
    
    # Calculate average confidence by removing NAs
    df['avg_confidence'] = df['word_confidence'].map(lambda list: np.mean([x for x in list if str(x) != 'nan']))
    
    # Add speaker_length
    df['speaker_length'] = df['speaker_end'] - df['speaker_start']
    
    # Add new columns: location, state, feed_name and is_fire_dept
    df['location'] = df['text'].map(get_location)
    df['state'] = 'CA'
    df['feed_name'] = df['feed'].map(remap_feed)
    df['department'] = df['feed_name'].map(police_fire)
    df = label_centroids(df, ca_places)
    
    return df

In [11]:
# Utilize previous functions to get master dataframe in the desired format
def get_dataframe(file_name, ca_places):
    
    # File name
    file_name = file_name

    # Step 1: get items and speaker df from json
    items_df, speakers_df = transcription_outputs(file_name)
                                                     
    # Step 2: combine dataframes (single word observations with speaker)
    df = append_speaker(items_df, speakers_df)

    # Step 3: observations by sentence and additional desired columns
    df = sentence_reconstruction(df, ca_places)
    
    return df

In [12]:
# Create dataframe based on location to be used for location mapping
def create_threat_df(df):
    threat_df = pd.DataFrame(columns=['latitude','longitude'])
    index = 0
    for i, row in enumerate(df.iterrows()):
        if len(row[1]['INTPTLAT']) > 0:
            for j, loc in enumerate(row[1]['INTPTLAT']):
                threat_df.loc[index,'latitude'] = row[1]['INTPTLAT'][j]
                threat_df.loc[index,'longitude'] = row[1]['INTPTLON'][j]
                threat_df.loc[index,'id_places'] = row[1]['ID_PLACES'][j]
                threat_df.loc[index,'text'] = row[1]['text_clean']
                threat_df.loc[index,'confidence'] = row[1]['avg_confidence']
                threat_df.loc[index,'feed'] = row[1]['feed_name']
                threat_df.loc[index,'start_time'] = row[1]['start_time']
                threat_df.loc[index,'end_time'] = row[1]['end_time']
                threat_df.loc[index,'department'] = row[1]['department']
                index += 1
    return threat_df

## Data Collection: Single Word Observation

In [13]:
# Get word dataframe

# Use glob to get list of all json files in the folder
files_json = (glob.glob('./data/translations/*.json'))

# Create empty dataframe
words = pd.DataFrame()
rows = 0 

for i, file_name in enumerate(files_json):
    word_one_file, _ = transcription_outputs(file_name)
    
    rows += len(word_one_file)
    
    # Print status
    print(f'{i+1} of {len(files_json)} Dataframes Completed ({rows} rows): {file_name}')
    
    # Add each dataframe together
    words = pd.concat([words, word_one_file])
    
# Reset index of master dataframe
words.reset_index(drop=True, inplace=True)

# Print shape of dataframe after going through all JSON files
print(f'Words Dataframe Shape: {words.shape}')

1 of 20 Dataframes Completed (558 rows): ./data/translations/201811081001-444704-22956_.json
2 of 20 Dataframes Completed (1146 rows): ./data/translations/201811080929-467022-25641_.json
3 of 20 Dataframes Completed (1201 rows): ./data/translations/201811080858-659667-24574_.json
4 of 20 Dataframes Completed (2511 rows): ./data/translations/201811080931-763045-22956_.json
5 of 20 Dataframes Completed (3921 rows): ./data/translations/201811080901-584135-22956_.json
6 of 20 Dataframes Completed (4113 rows): ./data/translations/201811081011-319947-26936_.json
7 of 20 Dataframes Completed (4253 rows): ./data/translations/201811080911-136992-26936_.json
8 of 20 Dataframes Completed (5080 rows): ./data/translations/201811081012-237044-1929_.json
9 of 20 Dataframes Completed (5143 rows): ./data/translations/201811080959-402082-25641_.json
10 of 20 Dataframes Completed (5208 rows): ./data/translations/201811080928-650127-24574_.json
11 of 20 Dataframes Completed (5637 rows): ./data/translation

In [14]:
words.head()

Unnamed: 0,content,confidence,start_time,end_time,type,feed
0,wait,0.3086,38.04,39.28,pronunciation,201811081001-444704-22956_
1,FIRE,0.9876,39.29,39.88,pronunciation,201811081001-444704-22956_
2,with,1.0,39.88,40.64,pronunciation,201811081001-444704-22956_
3,every,0.1804,41.22,41.41,pronunciation,201811081001-444704-22956_
4,night,0.2812,41.41,41.72,pronunciation,201811081001-444704-22956_


In [15]:
# Save words dataframe

# Export: save as csv
words.to_csv('./data/words.csv', index=False)

# Export: save as pkl
words.to_pickle('./data/words.pkl')

## Data Collection: Sentence by Speaker Observation

In [17]:
# Get sentence dataframe

# Use glob to get list of all json files in the folder
files_json = (glob.glob('./data/translations/*.json'))

# Create empty dataframe
sentence = pd.DataFrame()
rows = 0 

# Iterate through the files and get a dataframe for each file
for i, file_name in enumerate(files_json):
    
    # Get dataframe for an individual file
    df_one_file = get_dataframe(file_name, ca_places)
    rows += len(df_one_file)
    
    # Print status
    print(f'{i+1} of {len(files_json)} Dataframes Completed (new rows: {len(df_one_file)}, total rows: {rows}): {file_name}')
    
    # Add each dataframe together
    sentence = pd.concat([sentence, df_one_file])

# Reset index of master dataframe
sentence.reset_index(drop=True, inplace=True)

# Print shape of dataframe after going through all JSON files
print(f'Sentence Dataframe Shape: {sentence.shape}')

1 of 20 Dataframes Completed (new rows: 52, total rows: 52): ./data/translations/201811081001-444704-22956_.json
2 of 20 Dataframes Completed (new rows: 44, total rows: 96): ./data/translations/201811080929-467022-25641_.json
3 of 20 Dataframes Completed (new rows: 6, total rows: 102): ./data/translations/201811080858-659667-24574_.json
4 of 20 Dataframes Completed (new rows: 50, total rows: 152): ./data/translations/201811080931-763045-22956_.json
5 of 20 Dataframes Completed (new rows: 30, total rows: 182): ./data/translations/201811080901-584135-22956_.json
6 of 20 Dataframes Completed (new rows: 10, total rows: 192): ./data/translations/201811081011-319947-26936_.json
7 of 20 Dataframes Completed (new rows: 7, total rows: 199): ./data/translations/201811080911-136992-26936_.json
8 of 20 Dataframes Completed (new rows: 45, total rows: 244): ./data/translations/201811081012-237044-1929_.json
9 of 20 Dataframes Completed (new rows: 2, total rows: 246): ./data/translations/201811080959

#### Add Additional Columns

In [18]:
# Add column for clean text (remove punctuation and make all lowercase)
# Reference: Code adapted from NLP_EDA-InClass in DEN Flex by Sam Stack
def clean_text(raw_text):
    words = re.sub(r'[^a-z0-9]', r' ', raw_text.lower()).split()
    return ' '.join(words)

sentence['text_clean'] = sentence['text'].apply(clean_text)

In [19]:
# Add columns for start and end time as datetime objects (using speaker_start/end and file name)
start_time = []
end_time = []

for i in range(len(sentence)):
    start_time.append(actual_time_str(sentence['speaker_start'][i], sentence['feed'][i]))
    end_time.append(actual_time_str(sentence['speaker_end'][i], sentence['feed'][i]))

sentence['start_time'] = start_time
sentence['end_time'] = end_time

In [20]:
# Add columns that indicate if fire or evacuation related words were mentioned in that observation
sentence['contains_fire'] = sentence['text_clean'].map(lambda x: 1 if 'fire' in x else 0)
sentence['contains_evac'] = sentence['text_clean'].map(lambda x: 1 if 'evac' in x else 0)

In [21]:
# Look at dataframe
sentence.head(3)

Unnamed: 0,text,speaker_start,speaker_end,speaker_length,speaker,sentence,word_confidence,avg_confidence,min_conf,feed,...,feed_name,department,INTPTLON,INTPTLAT,ID_PLACES,text_clean,start_time,end_time,contains_fire,contains_evac
0,wait FIRE with every night .,52.94,53.39,0.45,spk_0,4,"[0.3086, 0.9876, 1.0, 0.1804, 0.2812, nan]",0.55156,0.1804,201811081001-444704-22956_,...,Chico_Paradise_Fire__CalFire,FIRE,[],[],[],wait fire with every night,2018-11-08 08:01:52-08:00,2018-11-08 08:01:53-08:00,1,0
1,Thirty,53.39,53.7,0.31,spk_2,5,[0.6417],0.6417,0.6417,201811081001-444704-22956_,...,Chico_Paradise_Fire__CalFire,FIRE,[],[],[],thirty,2018-11-08 08:01:53-08:00,2018-11-08 08:01:53-08:00,0,0
2,nine quarters . Affirmative evacuation order i...,78.64,78.85,0.21,spk_0,12,"[0.9994, 0.5192, nan, 0.8876, 0.9996, 0.8254, ...",0.8239,0.2566,201811081001-444704-22956_,...,Chico_Paradise_Fire__CalFire,FIRE,[],[],[],nine quarters affirmative evacuation order is ...,2018-11-08 08:02:18-08:00,2018-11-08 08:02:18-08:00,0,1


In [22]:
# Verify results of start and end time to datatime objects
print(sentence[['start_time']].min())
print(sentence[['start_time']].max())
print(sentence[['end_time']].max())

start_time   2018-11-08 06:44:31-08:00
dtype: datetime64[ns, US/Pacific]
start_time   2018-11-08 09:12:25-08:00
dtype: datetime64[ns, US/Pacific]
end_time   2018-11-08 09:12:26-08:00
dtype: datetime64[ns, US/Pacific]


In [23]:
# Sort dataframe by start time
sentence.sort_values(by = 'start_time', inplace = True)

# Reset index
sentence.reset_index(drop=True, inplace=True)

In [24]:
# Verify changes
print('Sentence Shape: ', sentence.shape)
sentence.index

Sentence Shape:  (464, 22)


RangeIndex(start=0, stop=464, step=1)

In [25]:
# Save words dataframe

# Export: save as csv
sentence.to_csv('./data/sentence.csv', index=False)

# Export: save as pkl
sentence.to_pickle('./data/sentence.pkl')

## Data Collection: Threatened Locations

In [26]:
# Creates a dataframe from the sentence dataframe that only includes the locations mentioned/threatened
threat = create_threat_df(sentence)

In [27]:
threat.head()

Unnamed: 0,latitude,longitude,id_places,text,confidence,feed,start_time,end_time,department
0,39.7542,-121.606,PARADISE,justin maguire is clear and counting down from...,0.806101,Oroville_Police_Fire,2018-11-08 06:59:15-08:00,2018-11-08 06:59:17-08:00,BOTH
1,37.4829,-118.602,PARADISE,justin maguire is clear and counting down from...,0.806101,Oroville_Police_Fire,2018-11-08 06:59:15-08:00,2018-11-08 06:59:17-08:00,BOTH
2,39.4955,-121.56,OROVILLE,left thirty eleven the one you re with and thi...,0.784975,Chico_Paradise_Fire__CalFire,2018-11-08 07:04:34-08:00,2018-11-08 07:04:34-08:00,FIRE
3,39.6315,-121.405,BERRY CREEK,left thirty eleven the one you re with and thi...,0.784975,Chico_Paradise_Fire__CalFire,2018-11-08 07:04:34-08:00,2018-11-08 07:04:34-08:00,FIRE
4,39.7703,-121.513,CONCOW,left thirty eleven the one you re with and thi...,0.784975,Chico_Paradise_Fire__CalFire,2018-11-08 07:04:34-08:00,2018-11-08 07:04:34-08:00,FIRE


In [28]:
# Export: save as csv
threat.to_csv('./data/threat.csv', index=False)

# Export: save as pkl
threat.to_pickle('./data/threat.pkl')