# Text Analytics with 3-1-1 Call Data
> An attempt at text analytics using non-emergency service calls.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/cincy_skyline.jpg

## Flow
- overview of TF-IDF
- remove stop words
- look at index of % services by department cut by neighborhood/time to showcase how we learn more from text analytics
- look at top words overall (these might be automated responses, if so remove)
- run tf-idf by zip code for set time period
     - explore differences between Hyde Park and Avondale
- run tf-idf by month for all neighbordhoods over set time period
     - explore differences between a single month per season


# About

Text analytics represents a growing set of tools to uncover insights on texts of interest.  In industry, this may include mining customer reviews or .

The set of tools is strong and enables us to 'listen' to people in a programmatic way.  Text analytics obviously has limitations compared to the understanding that comes from an actual human reading the text, but for processing large amounts of data it can be an effective tool within the confines of its limitations.

## In this Post

This notebook will demonstrate various text mining techniques, with our subject being 3-1-1 call data.  3-1-1 Call Centers are common in municipalities and represent a non-emergency hotline for residents to contact.  Incidents reported include fallen limbs, potholes, overgrown footprints, and much more. 

Text mining was a new topic for me.  I have been exposed to trainings in the past but have never wandered into the wild lands of raw text descriptions and tried to make better sense of them programmatically.  This ended up being a fun exercise with a ton of text cleaning and I learned a lot - I hope you do as well!

## Why is this Important

Querying and accessing data is a fundamental step of any data science workflow.  The mechanics of accessing data can get very messy, especially when it comes to government data.  Examples of this messiness include manually downloading files and saving to a user-defined location, appending multiple Excel files, scraping PDFs, and the list goes on.  Bottom line: it can get MESSY.

Luckily, the infrastructure provided by Socrata makes for a seamless experience that can be replicated by anyone with the Internet and Python.  You read that correctly, everything demonstrated in this post can be replicated with very little setup.  This is a huge perk for the sake of collaboration.  Additionally, in the event that an analysis needs to be re-run, perhaps on more recent data, the ability to query data programatically allows for minimal rework and minimal room for error.  Alright, enough hype let's get to it!

## Setup

Before we get rolling, it is important to be working in an environment with the necessary packages installed and available.  For this post we will need the `pandas`, `datetime` and `nltk` libraries installed.  As mentioned before, I like using conda to manage dependencies and would encourage others to go that route.

In [1]:
import pandas as pd
import datetime as dt
import numpy as np
import plotly.express as px

pd.options.display.max_colwidth = 100

# function takes five arguments (endpoint_url and 4 components of the query) 
# function returns a full, cleaned up API call
def generate_query(endpoint_url, query, limit):

    raw_query = (f"{endpoint_url}?$query="
                 f"{query}%20"
                 f"limit {limit}"
                )
    
    # get rid of control characters
    for replacements in ((" ", "%20"), ("\n", "%20")):
        raw_query = raw_query.replace(*replacements)
    
    return raw_query


## Initial Read

We are going to read in the past year of 3-1-1 call data for this exploration.

In [2]:
# define our base API endpoint
endpoint_311 = 'https://data.cincinnati-oh.gov/resource/4cjh-bm8b.json'

# dynamically generate today's date
today = dt.date.today()

# dynamically arrive at the date exactly 365 days ago
year_ago_today = today - dt.timedelta(days = 365)

Use our function to build a query.

In [3]:
pd.read_json(endpoint_311).head(5)

Unnamed: 0,jurisdiction_id,service_request_id,status,service_name,service_code,description,agency_responsible,requested_datetime,updated_datetime,expected_datetime,address,zipcode,latitude,longitude,requested_date,updated_date,last_table_update
0,CINCINNATI,SR14009513,CLOS,"""Metal Furniture, Spec Collectn""","""MTL-FRN""","""THere will also be a vacuum cleaner, and carpet cleaner for pick up""",Public Services,2014-02-05T23:21:00Z,2014-02-28T00:00:00Z,2014-02-19T00:00:00Z,"""4601 CRAWFORD AV, CINC - GJ1298238864""",45223,39.172947,-84.534519,2014-02-05T00:00:00.000,2014-02-28T00:00:00.000,2015-03-05T23:07:49.000
1,CINCINNATI,SR14009514,CLOS,"""Sign, street sign faded""","""STSGN""","""southwest corner of intersection""",Public Services,2014-02-05T23:21:00Z,2014-02-13T00:00:00Z,2014-02-06T00:00:00Z,"""MAIN ST & WOODWARD ST""",45202,39.110518,-84.511968,2014-02-05T00:00:00.000,2014-02-13T00:00:00.000,2015-03-05T23:07:49.000
2,CINCINNATI,SR14009515,CLOS,"""Slippery streets, request""","""SLPYST""","""Request entered through the Web. Refer to Intake Questions for further description.""",Public Services,2014-02-05T23:23:00Z,2014-03-07T00:00:00Z,2014-02-06T00:00:00Z,"""2779 MORNINGRIDGE DR, CINC - GJ0839534632""",45211,39.135951,-84.588884,2014-02-05T00:00:00.000,2014-03-07T00:00:00.000,2015-03-05T23:07:49.000
3,CINCINNATI,SR14009516,CLOS,"""Property damage, traffic aids""","""PRDMTAID""","""Transfer: 02/06/2014 6:43 AM/DCOTTRELLPlease check, mailbox was distroyed by snow plow - Reques...",Public Services,2014-02-05T23:24:00Z,2014-03-27T00:00:00Z,2014-02-20T00:00:00Z,"""4601 CRAWFORD AV, CINC - GJ1298238864""",45223,39.172947,-84.534519,2014-02-05T00:00:00.000,2014-03-27T00:00:00.000,2015-03-05T23:07:49.000
4,CINCINNATI,SR14009517,CLOS,"""Graffiti, removal""","""GRFITI""","""Request entered through the Web. Refer to Intake Questions for further description.""",Public Services,2014-02-05T23:28:00Z,2014-02-13T00:00:00Z,2014-03-07T00:00:00Z,"""1227 MAIN ST, CINC - GJ1512631874""",45202,39.109491,-84.511949,2014-02-05T00:00:00.000,2014-02-13T00:00:00.000,2015-03-05T23:07:49.000


Aggregate total counts of requests by `agency_responsible` for the past year.

In [4]:
total_agency_responsible_query = (
    
    generate_query(endpoint_url = endpoint_311, 
                   query = f"""select agency_responsible, 
                                      count(*) as n
                               where requested_datetime>='{year_ago_today}' 
                               and requested_datetime<='{today}'
                               group by agency_responsible""", 
                   limit = 100000000)
)

Grouping by `agency_responsible` we see that the bulk of our records lie in 'Public Services' and a few other departments.  In lieu of this information, let's only investigate the Public Services records and break it down by these mysterious `service_code` values.

In [5]:
agency_responsible_count = pd.read_json(total_agency_responsible_query)

agency_responsible_count \
    .assign(total_requests=lambda x: np.sum(x['n']),
            pct_total=lambda x: x['n'] / x['total_requests']) \
    .sort_values(by='pct_total', ascending = False).head(10)

Unnamed: 0,agency_responsible,n,total_requests,pct_total
13,Public Services,69303,95332,0.726965
4,Cinc Building Dept,5585,95332,0.058585
1,City Manager's Office,5271,95332,0.055291
2,Police Department,4611,95332,0.048368
6,Cinc Health Dept,4560,95332,0.047833
15,Dept of Trans and Eng,3514,95332,0.036861
16,Park Department,748,95332,0.007846
8,Cin Water Works,493,95332,0.005171
9,Fire Dept,413,95332,0.004332
3,Metropolitan Sewer,343,95332,0.003598


#### Filter to Public Services only and aggregate the counts across the entire city

In [6]:
total_services_query = (
    
generate_query(endpoint_url = endpoint_311, 
               query = f"""select service_code, 
                                  service_name, 
                                  count(*) as n 
                           where requested_datetime>='{year_ago_today}' 
                           and requested_datetime<='{today}' 
                           and agency_responsible == 'Public Services'
                           group by service_code, 
                                    service_name""", 
               limit = 10000000)

)

In [7]:
total_services = pd.read_json(total_services_query) \
    .assign(total_n=lambda x: np.sum(x['n']),
            pct_total=lambda x: x['n'] / x['total_n']) \
    .sort_values(by = 'pct_total', ascending = False)

total_services.head(10)

Unnamed: 0,service_code,service_name,n,total_n,pct_total
46,"""MTL-FRN""","""Metal Furniture, Spec Collectn""",24921,69303,0.359595
77,"""RF-COLLT""","""Trash, request for collection""",5244,69303,0.075668
85,"""YDWSTA-J""","""Yard waste,rtc""",3435,69303,0.049565
70,"""LITR-PRV""","""Litter, private property""",3394,69303,0.048973
36,"""PTHOLE""","""Pothole, repair""",3074,69303,0.044356
99,"""TLGR-PRV""","""Tall grass/weeds, private prop""",2970,69303,0.042855
79,"""SLPYST""","""Slippery streets, request""",2709,69303,0.039089
63,"""TGGDCLLC""","""Trash, tagged collections""",2268,69303,0.032726
45,"""TRSHCRTR""","""Trash cart, registration""",2161,69303,0.031182
67,"""STRSGN""","""Sign, down/missing """,1469,69303,0.021197


In [8]:
zipcode_services_query = (
    
generate_query(endpoint_url = endpoint_311, 
               query = f"""select zipcode,
                                  service_code, 
                                  service_name, 
                                  count(*) as n
                           where requested_datetime>='{year_ago_today}' 
                           and requested_datetime<='{today}' 
                           and agency_responsible == 'Public Services'
                           group by zipcode,
                                    service_code, 
                                    service_name""", 
               limit = 100000000)

)

In [9]:
zipcode_services = pd.read_json(zipcode_services_query) \
    .groupby('zipcode') \
    .apply(lambda x: x.assign(total_n=np.sum(x['n']))) \
    .assign(pct_total=lambda x: x['n'] / x['total_n'])


zipcode_services \
    .sort_values(by='pct_total', ascending = False).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,zipcode,service_code,service_name,n,total_n,pct_total
zipcode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
45002.0,1606,45002.0,"""SLPYST""","""Slippery streets, request""",1,1,1.0
45249.0,1512,45249.0,"""SLPYST""","""Slippery streets, request""",1,1,1.0
45248.0,1119,45248.0,"""MDDVSWNO""","""Media advis, winter operations""",1,1,1.0
45244.0,713,45244.0,"""PTHOLE""","""Pothole, repair""",1,1,1.0
45230.0,700,45230.0,"""MTL-FRN""","""Metal Furniture, Spec Collectn""",1173,2314,0.506914
45251.0,658,45251.0,"""DMGNOD""","""Damage Claim - NOD""",1,2,0.5
45247.0,894,45247.0,"""DMGNOD""","""Damage Claim - NOD""",1,2,0.5
45247.0,864,45247.0,"""SVCCMPLT""","""Service complaint, trash""",1,2,0.5
45236.0,241,45236.0,"""RF-COLLT""","""Trash, request for collection""",3,6,0.5
45251.0,1798,45251.0,"""SVCCMPLT""","""Service complaint, trash""",1,2,0.5


In [11]:
zipcode_services.merge(total_services, on=['service_code','service_name'], how='left') \
    .assign(service_index=lambda x: x['pct_total_x'] / x['pct_total_y']) \
    .query("n_x > 20") \
    .sort_values(by='service_index', ascending=False) \
    .query("zipcode == 45202") \
    .head(20)

Unnamed: 0,zipcode,service_code,service_name,n_x,total_n_x,pct_total_x,n_y,total_n_y,pct_total_y,service_index
13,45202.0,"""COVID_19""","""General Inquiry""",27,2591,0.010421,29,69303,0.000418,24.902927
63,45202.0,"""GRFITI""","""Graffiti, removal""",100,2591,0.038595,476,69303,0.006868,5.619241
15,45202.0,"""CRNRCNOF""","""Corner can, overflowing""",43,2591,0.016596,326,69303,0.004704,3.528056
78,45202.0,"""SCLEN1""","""Street cleaning""",165,2591,0.063682,1419,69303,0.020475,3.110185
61,45202.0,"""TRSHRQNS""","""Trash, request for new service""",23,2591,0.008877,211,69303,0.003045,2.915614
27,45202.0,"""TLGR-PS""","""Tall grass/weeds, PS property""",40,2591,0.015438,478,69303,0.006897,2.238292
2,45202.0,"""DUMP-PVS""","""Dumping, prv prop <2500 sq ft""",73,2591,0.028174,983,69303,0.014184,1.986342
45,45202.0,"""STRSGN""","""Sign, down/missing """,90,2591,0.034736,1469,69303,0.021197,1.638722
39,45202.0,"""LITR-PRV""","""Litter, private property""",204,2591,0.078734,3394,69303,0.048973,1.607692
11,45202.0,"""RWFRNTRT""","""ROW furniture/trash dumping""",37,2591,0.01428,616,69303,0.008889,1.606592


In [12]:
raw_descriptions_query = (
    
generate_query(endpoint_url = endpoint_311, 
               query = f"""select zipcode,
                                  requested_datetime,
                                  service_code,
                                  service_name,
                                  description
                            where requested_datetime>='{year_ago_today}' 
                            and requested_datetime<='{today}' 
                            and agency_responsible == 'Public Services'""",
               limit = 100000000000)

)

In [13]:
# get rid of 
# get rid of quotation marks in the string
# filter out blank description records
raw_311 = pd.read_json(raw_descriptions_query) \
    .assign(description_clean = lambda x: x['description'].str.replace('[^a-zA-Z\s]', '')
                                                          .str.replace('"', '')
                                                          .str.replace('  ', ' ')
                                                          .str.lower(),
            service_code = lambda x: x['service_code'].str.replace('"', ''),
            service_name = lambda x: x['service_name'].str.replace('"', '')) \
    .query("description_clean != 'request entered through the web refer to intake questions for further description'") \
    .query("description_clean != ' '")

  """


In [14]:
raw_311.head(5)

Unnamed: 0,zipcode,requested_datetime,service_code,service_name,description,description_clean
7,45238.0,2020-11-10T07:25:00Z,RF-COLLT,"Trash, request for collection","""TRASH RTC """,trash rtc
11,45207.0,2020-11-10T07:49:00Z,TGGDCLLC,"Trash, tagged collections","""TAGGED @ 0748am TOOK TOTER & (3) BAGS/ITEMS - LEFT THE REST""",tagged am took toter bagsitems left the rest
12,45227.0,2020-11-10T07:55:00Z,BLKYTMST,"Special collections, rtc","""- REFRIGERATOR/FREEZER - TWIN BED MATTRESS - BOX SPRING""",refrigeratorfreezer twin bed mattress box spring
14,45225.0,2020-11-10T07:58:00Z,CRNRCNOF,"Corner can, overflowing","""CORNER CAN OVERFLOWING \nHOUSEHOLD TRASH IS BEING PUT IN THE CAN CITIZEN WOULD LIKE TO SEE IF I...",corner can overflowing \nhousehold trash is being put in the can citizen would like to see if it...
15,45207.0,2020-11-10T08:00:00Z,YDWSTA-J,"Yard waste,rtc","""Transfer: 11/10/2020 10:53 AM/DJOHNSON1\nI got 4-5 of leaves bags that need to get pick up. \n ...",transfer amdjohnson\ni got of leaves bags that need to get pick up \n request entered through t...


In [15]:
# concatenate all strings within the same zipcode
zipcode_descriptions = raw_311.groupby(['zipcode'])['description_clean'].apply(' '.join).reset_index()
zipcode_descriptions.head(5)

Unnamed: 0,zipcode,description_clean
0,45002.0,citizen request the street be salted and treated immediately
1,45202.0,transfer amccolemanentire block was missed yard waste has been out for over a week request ente...
2,45203.0,bolts and debris in the road way tagged am took toter bagsitems left the rest tagged am took t...
3,45204.0,tires and trash bags rtc trash cart \nentire street pu at neff ave pu at neff\n amsgoodwin ou...
4,45205.0,special collections yard waste rtc cans are being left out to block parking spots and the side...


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(input = 'content', stop_words='english', smooth_idf = True)

tfidf_vector = tfidf_vectorizer.fit_transform(zipcode_descriptions['description_clean'])


In [17]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), 
                        index=zipcode_descriptions['zipcode'], 
                        columns=tfidf_vectorizer.get_feature_names()).reset_index()

tfidf_df

Unnamed: 0,zipcode,aa,aaron,aas,ab,abandoned,abanoned,abatement,abgle,abigail,...,yw,zier,zinsle,zip,zips,zone,zoning,zoo,zoom,zula
0,45002.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,45202.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.009486,...,0.026812,0.0,0.0,0.0,0.0,0.006195,0.0,0.0,0.009486,0.0
2,45203.0,0.0,0.028588,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.019985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,45204.0,0.0,0.0,0.0,0.0,0.003075,0.0,0.0,0.0,0.0,...,0.026174,0.0,0.0,0.0,0.0,0.004187,0.0,0.0,0.0,0.0
4,45205.0,0.0,0.0,0.0,0.00236,0.002519,0.0,0.0,0.00236,0.0,...,0.030521,0.0,0.0,0.0,0.0,0.0,0.00788,0.0,0.0,0.0
5,45206.0,0.0,0.0,0.0,0.0,0.002424,0.0,0.0,0.0,0.0,...,0.042862,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,45207.0,0.0,0.0,0.0,0.0,0.003393,0.0,0.0,0.0,0.0,...,0.019994,0.0,0.0,0.0,0.0,0.00462,0.0,0.0,0.0,0.0
7,45208.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.076858,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,45209.0,0.007848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.043888,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,45211.0,0.0,0.0,0.0,0.0,0.002938,0.0,0.0,0.0,0.0,...,0.045207,0.0,0.0,0.0,0.0,0.002,0.0,0.0,0.0,0.0


In [18]:
tfidf_df \
    .melt(id_vars = 'zipcode', value_vars = tfidf_df.columns) \
    .query("variable not in ('trash','rtc')") \
    .sort_values(by = 'value', ascending = False).head(40)

Unnamed: 0,zipcode,variable,value
243085,45249.0,salt,1.0
267492,45236.0,sunset,0.633035
135837,45002.0,immediately,0.605926
189888,45248.0,parrish,0.601534
142542,45248.0,jill,0.601534
117888,45236.0,glenway,0.599353
139344,45248.0,interview,0.497185
243087,45002.0,salted,0.475674
37708,45244.0,beechmont,0.449277
250219,45244.0,shift,0.373148


In [19]:
tfidf_df \
    .melt(id_vars = 'zipcode', value_vars = tfidf_df.columns) \
    .query("variable not in ('trash','rtc')") \
    .query("zipcode == 45202") \
    .sort_values(by = 'value', ascending = False).head(40)

Unnamed: 0,zipcode,variable,value
280801,45202.0,transfer,0.297219
232285,45202.0,request,0.256988
226123,45202.0,refer,0.249098
220741,45202.0,questions,0.246267
264460,45202.0,street,0.245364
295660,45202.0,web,0.243437
97735,45202.0,entered,0.243437
83461,45202.0,description,0.243437
138802,45202.0,intake,0.243437
169339,45202.0,missed,0.155903


In [20]:
raw_311.groupby('description').agg('count').sort_values(by = 'zipcode', ascending = False).head(20)

Unnamed: 0_level_0,zipcode,requested_datetime,service_code,service_name,description_clean
description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"""SPECIAL COLLECTION """,1389,1389,1389,1389,1389
"""RTC""",459,459,459,459,459
"""SPECIAL COLLECTIONS """,409,409,409,409,409
"""POTHOLE""",178,178,178,178,178
"""REQUESTING SET OF STICKERS (1)""",165,165,165,165,165
"""RTC- ENTIRE STREET MISSED""",135,135,135,135,135
"""MISSED TRASH COLLECTION """,133,133,133,133,133
"""RTC- CANS""",125,125,125,125,125
"""REQUESTING SALT TRUCK""",112,112,112,112,112
"""Bags""",103,103,103,103,103
