# Text Analytics with 3-1-1 Call Data
> An attempt at text analytics using non-emergency service calls.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/cincy_skyline.jpg

## Flow
- overview of TF-IDF
- remove stop words
- look at index of % services by department cut by neighborhood/time to showcase how we learn more from text analytics
- look at top words overall (these might be automated responses, if so remove)
- run tf-idf by neighborhood for set time period
     - explore differences between Hyde Park and Avondale
- run tf-idf by month for all neighbordhoods over set time period
     - explore differences between a single month per season


# About

Text analytics represents a growing set of tools to uncover insights on texts of interest.  In industry, this may include mining customer reviews or .

The set of tools is strong and enables us to 'listen' to people in a programmatic way.  Text analytics obviously has limitations compared to the understanding that comes from an actual human reading the text, but for processing large amounts of data it can be an effective tool within the confines of its limitations.

## In this Post

This notebook will demonstrate various text mining techniques, with our subject being 3-1-1 call data.  3-1-1 Call Centers are common in municipalities and represent a non-emergency hotline for residents to contact.  Incidents reported include fallen limbs, potholes, overgrown footprints, and much more. 

Text mining was a new topic for me.  I have been exposed to trainings in the past but have never wandered into the wild lands of raw text descriptions and tried to make better sense of them programmatically.  This ended up being a fun exercise with a ton of text cleaning and I learned a lot - I hope you do as well!

## Why is this Important

Querying and accessing data is a fundamental step of any data science workflow.  The mechanics of accessing data can get very messy, especially when it comes to government data.  Examples of this messiness include manually downloading files and saving to a user-defined location, appending multiple Excel files, scraping PDFs, and the list goes on.  Bottom line: it can get MESSY.

Luckily, the infrastructure provided by Socrata makes for a seamless experience that can be replicated by anyone with the Internet and Python.  You read that correctly, everything demonstrated in this post can be replicated with very little setup.  This is a huge perk for the sake of collaboration.  Additionally, in the event that an analysis needs to be re-run, perhaps on more recent data, the ability to query data programatically allows for minimal rework and minimal room for error.  Alright, enough hype let's get to it!

## Setup

Before we get rolling, it is important to be working in an environment with the necessary packages installed and available.  For this post we will need the `pandas`, `datetime` and `nltk` libraries installed.  As mentioned before, I like using conda to manage dependencies and would encourage others to go that route.

In [1]:
import pandas as pd
import datetime as dt
import numpy as np
import plotly.express as px

pd.options.display.max_colwidth = 100

# function takes five arguments (endpoint_url and 4 components of the query) 
# function returns a full, cleaned up API call
def generate_query(endpoint_url, query, limit):

    raw_query = (f"{endpoint_url}?$query="
                 f"{query}%20"
                 f"limit {limit}"
                )
    
    # get rid of control characters
    for replacements in ((" ", "%20"), ("\n", "%20")):
        raw_query = raw_query.replace(*replacements)
    
    return raw_query


## Initial Read

We are going to read in the past year of 3-1-1 call data for this exploration.

In [2]:
# define our base API endpoint
endpoint_311 = 'https://data.cincinnati-oh.gov/resource/4cjh-bm8b.json'

# dynamically generate today's date
today = dt.date.today()

# dynamically arrive at the date exactly 365 days ago
year_ago_today = today - dt.timedelta(days = 365)

Use our function to build a query.

In [3]:
pd.read_json(endpoint_311).head(5)

Unnamed: 0,jurisdiction_id,service_request_id,status,service_name,service_code,description,agency_responsible,requested_datetime,updated_datetime,expected_datetime,address,zipcode,latitude,longitude,requested_date,updated_date,last_table_update
0,CINCINNATI,SR14009513,CLOS,"""Metal Furniture, Spec Collectn""","""MTL-FRN""","""THere will also be a vacuum cleaner, and carpet cleaner for pick up""",Public Services,2014-02-05T23:21:00Z,2014-02-28T00:00:00Z,2014-02-19T00:00:00Z,"""4601 CRAWFORD AV, CINC - GJ1298238864""",45223,39.172947,-84.534519,2014-02-05T00:00:00.000,2014-02-28T00:00:00.000,2015-03-05T23:07:49.000
1,CINCINNATI,SR14009514,CLOS,"""Sign, street sign faded""","""STSGN""","""southwest corner of intersection""",Public Services,2014-02-05T23:21:00Z,2014-02-13T00:00:00Z,2014-02-06T00:00:00Z,"""MAIN ST & WOODWARD ST""",45202,39.110518,-84.511968,2014-02-05T00:00:00.000,2014-02-13T00:00:00.000,2015-03-05T23:07:49.000
2,CINCINNATI,SR14009515,CLOS,"""Slippery streets, request""","""SLPYST""","""Request entered through the Web. Refer to Intake Questions for further description.""",Public Services,2014-02-05T23:23:00Z,2014-03-07T00:00:00Z,2014-02-06T00:00:00Z,"""2779 MORNINGRIDGE DR, CINC - GJ0839534632""",45211,39.135951,-84.588884,2014-02-05T00:00:00.000,2014-03-07T00:00:00.000,2015-03-05T23:07:49.000
3,CINCINNATI,SR14009516,CLOS,"""Property damage, traffic aids""","""PRDMTAID""","""Transfer: 02/06/2014 6:43 AM/DCOTTRELLPlease check, mailbox was distroyed by snow plow - Reques...",Public Services,2014-02-05T23:24:00Z,2014-03-27T00:00:00Z,2014-02-20T00:00:00Z,"""4601 CRAWFORD AV, CINC - GJ1298238864""",45223,39.172947,-84.534519,2014-02-05T00:00:00.000,2014-03-27T00:00:00.000,2015-03-05T23:07:49.000
4,CINCINNATI,SR14009517,CLOS,"""Graffiti, removal""","""GRFITI""","""Request entered through the Web. Refer to Intake Questions for further description.""",Public Services,2014-02-05T23:28:00Z,2014-02-13T00:00:00Z,2014-03-07T00:00:00Z,"""1227 MAIN ST, CINC - GJ1512631874""",45202,39.109491,-84.511949,2014-02-05T00:00:00.000,2014-02-13T00:00:00.000,2015-03-05T23:07:49.000


Aggregate total counts of requests by `agency_responsible` for the past year.

In [161]:
total_agency_responsible_query = (
    
    generate_query(endpoint_url = endpoint_311, 
                   query = f"""select agency_responsible, 
                                      count(*) as n
                               where requested_datetime>='{year_ago_today}' 
                               and requested_datetime<='{today}'
                               group by agency_responsible""", 
                   limit = 100000000)
)

Grouping by `agency_responsible` we see that the bulk of our records lie in 'Public Services' and a few other departments.  In lieu of this information, let's only investigate the Public Services records and break it down by these mysterious `service_code` values.

In [162]:
agency_responsible_count = pd.read_json(total_agency_responsible_query)

agency_responsible_count \
    .assign(total_requests=lambda x: np.sum(x['n']),
            pct_total=lambda x: x['n'] / x['total_requests']) \
    .sort_values(by='pct_total', ascending = False).head(10)

Unnamed: 0,agency_responsible,n,total_requests,pct_total
13,Public Services,74969,103266,0.72598
4,Cinc Building Dept,6047,103266,0.058558
1,City Manager's Office,5695,103266,0.055149
6,Cinc Health Dept,5057,103266,0.048971
2,Police Department,4932,103266,0.04776
15,Dept of Trans and Eng,3866,103266,0.037437
16,Park Department,786,103266,0.007611
8,Cin Water Works,540,103266,0.005229
9,Fire Dept,468,103266,0.004532
3,Metropolitan Sewer,381,103266,0.00369


#### Filter to Public Services only and aggregate the counts across the entire city

In [163]:
total_services_query = (
    
generate_query(endpoint_url = endpoint_311, 
               query = f"""select service_code, 
                                  service_name, 
                                  count(*) as n 
                           where requested_datetime>='{year_ago_today}' 
                           and requested_datetime<='{today}' 
                           and agency_responsible == 'Public Services'
                           group by service_code, 
                                    service_name""", 
               limit = 10000000)

)

In [164]:
total_services = pd.read_json(total_services_query) \
    .assign(total_n=lambda x: np.sum(x['n']),
            pct_total=lambda x: x['n'] / x['total_n']) \
    .sort_values(by = 'pct_total', ascending = False)

total_services.head(10)

Unnamed: 0,service_code,service_name,n,total_n,pct_total
46,"""MTL-FRN""","""Metal Furniture, Spec Collectn""",27088,74969,0.361323
77,"""RF-COLLT""","""Trash, request for collection""",5757,74969,0.076792
85,"""YDWSTA-J""","""Yard waste,rtc""",3744,74969,0.049941
70,"""LITR-PRV""","""Litter, private property""",3643,74969,0.048593
36,"""PTHOLE""","""Pothole, repair""",3173,74969,0.042324
99,"""TLGR-PRV""","""Tall grass/weeds, private prop""",3061,74969,0.04083
79,"""SLPYST""","""Slippery streets, request""",2712,74969,0.036175
63,"""TGGDCLLC""","""Trash, tagged collections""",2495,74969,0.03328
45,"""TRSHCRTR""","""Trash cart, registration""",2349,74969,0.031333
41,"""TIRES""","""Tires, Special Collection""",1640,74969,0.021876


In [165]:
zipcode_services_query = (
    
generate_query(endpoint_url = endpoint_311, 
               query = f"""select zipcode,
                                  service_code, 
                                  service_name, 
                                  count(*) as n
                           where requested_datetime>='{year_ago_today}' 
                           and requested_datetime<='{today}' 
                           and agency_responsible == 'Public Services'
                           group by zipcode,
                                    service_code, 
                                    service_name""", 
               limit = 100000000)

)

In [168]:
zipcode_services = pd.read_json(zipcode_services_query) \
    .groupby('zipcode') \
    .apply(lambda x: x.assign(total_n=np.sum(x['n']))) \
    .assign(pct_total=lambda x: x['n'] / x['total_n'])


zipcode_services \
    .sort_values(by='pct_total', ascending = False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,zipcode,service_code,service_name,n,total_n,pct_total
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
45002.0,1648,45002.0,"""SLPYST""","""Slippery streets, request""",1,1,1.0
45249.0,1551,45249.0,"""SLPYST""","""Slippery streets, request""",1,1,1.0
45244.0,731,45244.0,"""PTHOLE""","""Pothole, repair""",1,1,1.0
45241.0,389,45241.0,"""TRSHREMV""","""Trash cart, remove""",1,1,1.0
45236.0,248,45236.0,"""RF-COLLT""","""Trash, request for collection""",4,7,0.571429
45230.0,718,45230.0,"""MTL-FRN""","""Metal Furniture, Spec Collectn""",1273,2511,0.506969
45251.0,676,45251.0,"""DMGNOD""","""Damage Claim - NOD""",1,2,0.5
45248.0,1854,45248.0,"""DAPUB1""","""Dead animal""",1,2,0.5
45248.0,1150,45248.0,"""MDDVSWNO""","""Media advis, winter operations""",1,2,0.5
45247.0,916,45247.0,"""DMGNOD""","""Damage Claim - NOD""",1,2,0.5


In [169]:
zipcode_services.merge(total_services, on=['service_code','service_name'], how='left') \
    .assign(service_index=lambda x: x['pct_total_x'] / x['pct_total_y']) \
    .query("n_x > 20") \
    .sort_values(by='service_index', ascending=False) \
    .query("zipcode == 45202") \
    .head(20)

Unnamed: 0,zipcode,service_code,service_name,n_x,total_n_x,pct_total_x,n_y,total_n_y,pct_total_y,service_index
13,45202.0,"""COVID_19""","""General Inquiry""",29,2823,0.010273,31,74969,0.000414,24.843178
64,45202.0,"""GRFITI""","""Graffiti, removal""",115,2823,0.040737,526,74969,0.007016,5.806079
15,45202.0,"""CRNRCNOF""","""Corner can, overflowing""",48,2823,0.017003,357,74969,0.004762,3.570622
62,45202.0,"""TRSHRQNS""","""Trash, request for new service""",26,2823,0.00921,237,74969,0.003161,2.913371
79,45202.0,"""SCLEN1""","""Street cleaning""",168,2823,0.059511,1636,74969,0.021822,2.727073
27,45202.0,"""TLGR-PS""","""Tall grass/weeds, PS property""",42,2823,0.014878,495,74969,0.006603,2.253279
2,45202.0,"""DUMP-PVS""","""Dumping, prv prop <2500 sq ft""",82,2823,0.029047,1073,74969,0.014313,2.029481
11,45202.0,"""RWFRNTRT""","""ROW furniture/trash dumping""",47,2823,0.016649,692,74969,0.00923,1.803693
46,45202.0,"""STRSGN""","""Sign, down/missing """,99,2823,0.035069,1560,74969,0.020809,1.685316
39,45202.0,"""LITR-PRV""","""Litter, private property""",226,2823,0.080057,3643,74969,0.048593,1.64748


In [170]:
raw_descriptions_query = (
    
generate_query(endpoint_url = endpoint_311, 
               query = f"""select zipcode,
                                  requested_datetime,
                                  service_code,
                                  service_name,
                                  description
                            where requested_datetime>='{year_ago_today}' 
                            and requested_datetime<='{today}' 
                            and agency_responsible == 'Public Services'""",
               limit = 100000000000)

)

In [171]:
# get rid of 
# get rid of quotation marks in the string
# filter out blank description records
raw_311 = pd.read_json(raw_descriptions_query) \
    .assign(description_clean = lambda x: x['description'].str.replace('[^a-zA-Z\s]', '')
                                                          .str.replace('"', '')
                                                          .str.replace('  ', ' ')
                                                          .str.lower(),
            service_code = lambda x: x['service_code'].str.replace('"', ''),
            service_name = lambda x: x['service_name'].str.replace('"', '')) \
    .query("description_clean != 'request entered through the web refer to intake questions for further description'") \
    .query("description_clean != ' '")

  """


In [172]:
raw_311.head(5)

Unnamed: 0,zipcode,requested_datetime,service_code,service_name,description,description_clean
9,45205.0,2020-10-08T07:17:00Z,DAPUB1,Dead animal,"""DEAD ANIMAL RACOON ON W 8TH NEAR SUNSET IN CURB LANE""",dead animal racoon on w th near sunset in curb lane
11,45224.0,2020-10-08T07:25:00Z,TGGDCLLC,"Trash, tagged collections","""TRASH NOT OUT @ 0709am""",trash not out am
13,45224.0,2020-10-08T07:26:00Z,TGGDCLLC,"Trash, tagged collections","""TAGGED @ 0709am TOOK TOTER & (3) BAGS/ITEMS - LEFT THE REST""",tagged am took toter bagsitems left the rest
14,45223.0,2020-10-08T07:28:00Z,RF-COLLT,"Trash, request for collection","""TRASH NOT OUT @ 0715am\n10/08/2020 3:09 PM/RMCCRAY - LSO- CART AT THE CURB\n10/12/2020 4:52 PM/...",trash not out am\n pmrmccray lso cart at the curb\n pmsgoodwin trash still not collected
15,45239.0,2020-10-08T07:28:00Z,DAPUB1,Dead animal,"""DEAD RACOON IN WHITE TRASH BAG AT THE CURB""",dead racoon in white trash bag at the curb


In [173]:
# concatenate all strings within the same zipcode
zipcode_descriptions = raw_311.groupby(['zipcode'])['description_clean'].apply(' '.join).reset_index()

In [174]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(input = 'content', stop_words='english', smooth_idf = True)

tfidf_vector = tfidf_vectorizer.fit_transform(zipcode_descriptions['description_clean'])


In [175]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), 
                        index=zipcode_descriptions['zipcode'], 
                        columns=tfidf_vectorizer.get_feature_names()).reset_index()

tfidf_df

Unnamed: 0,zipcode,aa,aaron,aas,ab,abandoned,abanoned,abatement,abgle,abigail,...,yw,zier,zinsle,zip,zips,zone,zoning,zoo,zoom,zula
0,45002.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,45202.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.008696,...,0.02906,0.0,0.0,0.0,0.0,0.005697,0.0,0.0,0.008696,0.0
2,45203.0,0.0,0.027097,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.019183,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,45204.0,0.0,0.0,0.0,0.0,0.002889,0.0,0.0,0.0,0.0,...,0.026668,0.0,0.0,0.0,0.0,0.003921,0.0,0.0,0.0,0.0
4,45205.0,0.0,0.0,0.0,0.002162,0.002322,0.0,0.0,0.002162,0.0,...,0.035966,0.0,0.0,0.0,0.0,0.0,0.007213,0.0,0.0,0.0
5,45206.0,0.0,0.0,0.0,0.0,0.002292,0.0,0.0,0.0,0.0,...,0.045321,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,45207.0,0.0,0.0,0.0,0.0,0.003162,0.0,0.0,0.0,0.0,...,0.029188,0.0,0.0,0.0,0.0,0.004292,0.0,0.0,0.0,0.0
7,45208.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.084167,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,45209.0,0.007376,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.041774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,45211.0,0.0,0.0,0.0,0.0,0.002752,0.0,0.0,0.0,0.0,...,0.045355,0.0,0.0,0.0,0.0,0.003735,0.0,0.0,0.0,0.0


In [177]:
tfidf_df \
    .melt(id_vars = 'zipcode', value_vars = tfidf_df.columns) \
    .sort_values(by = 'value', ascending = False).head(40)

Unnamed: 0,zipcode,variable,value
261078,45249.0,salt,1.0
287310,45236.0,sunset,0.634326
145600,45002.0,immediately,0.603177
152837,45248.0,jill,0.601069
204077,45248.0,parrish,0.601069
126390,45236.0,glenway,0.60086
110354,45241.0,facility,0.540586
283394,45241.0,storage,0.497566
149357,45248.0,interview,0.497441
301905,45229.0,trash,0.489658
