In [1]:
import pandas as pd
import sqlalchemy as sa
import psycopg2 as ps
from sqlalchemy import create_engine

In [2]:
%load_ext sql
%sql postgresql://postgres:lingga28@localhost:2828/datacamp
conn = create_engine('postgresql://postgres:lingga28@localhost/datacamp')

# 1. Count the categories
### Exercises
In this chapter, we'll be working mostly with the Evanston 311 data in table evanston311. This is data on help requests submitted to the city of Evanston, IL.

This data has several character columns. Start by examining the most frequent values in some of these columns to get familiar with the common categories.

### task 1
### Instruction
How many rows does each priority level have?

In [3]:
%%sql

-- Select the count of each level of priority
SELECT priority, count(*)
  FROM evanston311
 GROUP BY priority;

 * postgresql://postgres:***@localhost:2828/datacamp
4 rows affected.


priority,count
MEDIUM,5745
NONE,30081
HIGH,88
LOW,517


### task 2
### Instruction
ow many distinct values of zip appear in at least 100 rows?

In [4]:
%%sql

-- Find values of zip that appear in at least 100 rows
-- Also get the count of each value
SELECT zip, count(*)
  FROM evanston311
 GROUP BY zip
HAVING count(*) >=100; 

 * postgresql://postgres:***@localhost:2828/datacamp
4 rows affected.


zip,count
60201.0,19054
,5528
60202.0,11165
60208.0,255


### task 3
### Instruction
How many distinct values of source appear in at least 100 rows?

In [5]:
%%sql

-- Find values of source that appear in at least 100 rows
-- Also get the count of each value
SELECT source, count(*)
  FROM evanston311
 GROUP BY source
HAVING count(*) >=100;

 * postgresql://postgres:***@localhost:2828/datacamp
4 rows affected.


source,count
gov.publicstuff.com,30985
Android,444
Iframe,3670
iOS,1199


### task 4
### Instruction
Select the five most common values of street and the count of each.

In [7]:
%%sql

-- Find the 5 most common values of street and the count of each
SELECT street, count(*)
  FROM evanston311
 GROUP BY street
 ORDER BY count(*) DESC
 LIMIT 5;

 * postgresql://postgres:***@localhost:2828/datacamp
5 rows affected.


street,count
,1699
Chicago Avenue,1440
Sherman Avenue,1276
Central Street,1211
Davis Street,1154


# 2. Spotting character data problems
### Exercises
Explore the distinct values of the street column. Select each street value and the count of the number of rows with that value. Sort the results by street to see similar values near each other.

Look at the results.

Which of the following is NOT an issue you see with the values of street?

### Instructions
### Possible Answers
- A. The street suffix (e.g. Street, Avenue) is sometimes abbreviated
- B. There are sometimes extra spaces at the beginning and end of values
- C. House/street numbers sometimes appear in the column
- D.Capitalization is not consistent across values
- E. All of the above are potential problems

Answer: B

# 3. Trimming
### Exercises
Some of the street values in evanston311 include house numbers with # or / in them. In addition, some street values end in a ..

Remove the house numbers, extra punctuation, and any spaces from the beginning and end of the street values as a first attempt at cleaning up the values.

Instructions
- Trim digits 0-9, #, /, ., and spaces from the beginning and end of street.
- Select distinct original street value and the corrected street value.
- Order the results by the original street value.

In [9]:
%%sql

SELECT distinct street,
       -- Trim off unwanted characters from street
       trim(street, '0123456789 # / .') AS cleaned_street
  FROM evanston311
 ORDER BY street
    LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


street,cleaned_street
1/2 Chicago Ave,Chicago Ave
1047B Chicago Ave,B Chicago Ave
13th Street,th Street
141A Callan Ave,A Callan Ave
141b Callan Ave,b Callan Ave
1624B Central St,B Central St
217A Dodge Ave,A Dodge Ave
221c Dodge Ave,c Dodge Ave
300c Dodge Ave,c Dodge Ave
3314A Central St,A Central St


# 4. Exploring unstructured text
### Exercises
The description column of evanston311 has the details of the inquiry, while the category column groups inquiries into different types. How well does the category capture what's in the description?

LIKE and ILIKE queries will help you find relevant descriptions and categories. Remember that with LIKE queries, you can include a % on each side of a word to find values that contain the word. For example:

SELECT category\
  FROM evanston311\
 WHERE category LIKE '%Taxi%';\

% matches 0 or more characters.

Building up the query through the steps below, find inquires that mention trash or garbage in the description without trash or garbage being in the category. What are the most frequent categories for such inquiries?

### task 1
### Instruction
Use ILIKE to count rows in evanston311 where the description contains 'trash' or 'garbage' regardless of case.

In [10]:
%%sql

-- Count rows
SELECT count(*)
  FROM evanston311
 -- Where description includes trash or garbage
 WHERE description ILIKE '%trash%'
    OR description ILIKE '%garbage%';

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


count
2551


### task 2
### Instruction
category values are in title case. Use LIKE to find category values with 'Trash' or 'Garbage' in them.

In [11]:
%%sql

-- Select categories containing Trash or Garbage
SELECT category
  FROM evanston311
 -- Use LIKE
 WHERE category LIKE '%Trash%'
    OR category LIKE '%Garbage%'
 LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


category
THIS REQUEST IS INACTIVE...Trash Cart - Compost Bin
Trash - Tire Pickup
Trash - Special Pickup - Resident Use
"Trash, Recycling, Yard Waste Cart- Repair/Replacement"
"Trash, Recycling, Yard Waste Cart- Repair/Replacement"
Trash - Missed Garbage Pickup
THIS REQUEST IS INACTIVE...Trash Cart - Compost Bin
Trash - Tire Pickup
Trash - Missed Garbage Pickup
Trash - Accumulation


### task 3
### Instruction
Count rows where the description includes 'trash' or 'garbage' but the category does not.

In [12]:
%%sql

-- Count rows
SELECT Count(*)
  FROM evanston311 
 -- description contains trash or garbage (any case)
 WHERE (description ILIKE '%Trash%'
    OR description ILIKE '%Garbage%') 
 -- category does not contain Trash or Garbage
   AND category NOT LIKE '%Trash%'
   AND category NOT LIKE '%Garbage%'
LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


count
570


### task 4
### Instruction
Count rows where the description includes 'trash' or 'garbage' but the category does not.

In [13]:
%%sql

-- Count rows with each category
SELECT category, count(*)
  FROM evanston311 
 WHERE (description ILIKE '%trash%'
    OR description ILIKE '%garbage%') 
   AND category NOT LIKE '%Trash%'
   AND category NOT LIKE '%Garbage%'
 -- What are you counting?
 GROUP BY category
 --- order by most frequent values
 ORDER BY count DESC
 LIMIT 10;

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


category,count
Ask A Question / Send A Message,273
Rodents- Rats,77
Recycling - Missed Pickup,28
Dead Animal on Public Property,16
Graffiti,15
Yard Waste - Missed Pickup,14
Public Transit Agency Issue,13
Food Establishment - Unsanitary Conditions,13
Exterior Conditions,10
Street Sweeping,9


# 5. Concatenate strings
### Exercises
House number (house_num) and street are in two separate columns in evanston311. Concatenate them together with concat() with a space in between the values.

### Instructions
- Concatenate house_num, a space ' ', and street into a single value using the concat().
- Use a trim function to remove any spaces from the start of the concatenated value.

In [14]:
%%sql

-- Concatenate house_num, a space, and street
-- and trim spaces from the start of the result
SELECT trim(concat(house_num, ' ', street)) AS address
  FROM evanston311
    LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


address
606-612 Sheridan Road
930 Washington St
1183-1223 Lincoln St
1–111 Callan Ave
1524 Crain St
2830 Central Street
1139 Dodge Ave
900 Oakton Street
608 Oakton Street
1320 Dewey Avenue


# 6. Split strings on a delimiter
### Exercises
The street suffix is the part of the street name that gives the type of street, such as Avenue, Road, or Street. In the Evanston 311 data, sometimes the street suffix is the full word, while other times it is the abbreviation.

Extract just the first word of each street value to find the most common streets regardless of the suffix.

To do this, use

split_part(string_to_split, delimiter, part_number)

### Instructions
- Use split_part() to select the first word in street; alias the result as street_name.
- Also select the count of each value of street_name.

In [15]:
%%sql

-- Select the first word of the street value
SELECT split_part(street, ' ', 1) AS street_name, 
       count(*)
  FROM evanston311
 GROUP BY street_name
 ORDER BY count DESC
 LIMIT 20;

 * postgresql://postgres:***@localhost:2828/datacamp
20 rows affected.


street_name,count
,1699
Chicago,1569
Central,1529
Sherman,1479
Davis,1248
Church,1225
Main,880
Sheridan,842
Ridge,823
Dodge,816


# 7. Shorten long strings
### Exercises
The description column of evanston311 can be very long. You can get the length of a string with the length() function.

For displaying or quickly reviewing the data, you might want to only display the first few characters. You can use the left() function to get a specified number of characters at the start of each value.

To indicate that more data is available, concatenate '...' to the end of any shortened description. To do this, you can use a CASE WHEN statement to add '...' only when the string length is greater than 50.

Select the first 50 characters of description when description starts with the word "I".

### Instructions
- Select the first 50 characters of description with '...' concatenated on the end where the length() of the description is greater than 50 characters. Otherwise just select the description as is.
- Select only descriptions that begin with the word 'I' and not the letter 'I'.
- For example, you would want to select "I like using SQL!", but would not want to select "In this course we use SQL!".

In [17]:
%%sql

-- Select the first 50 chars when length is greater than 50
SELECT CASE WHEN length(description) > 50
            THEN left(description, 50) || '...'
       -- otherwise just select description
       ELSE description
       END
  FROM evanston311
 -- limit to descriptions that start with the word I
 WHERE description LIKE 'I %'
 ORDER BY description
    LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


description
I work for Schermerhorn & Co. and manage this con...
"I Live in a townhouse with garbage cans in back, i..."
"I Put In For Reserve Disabled Parking, A Week Ago ..."
I SDO GOWANS #1258 RECEIVED A TELEPHONE CALL ON 3/...
I accidentally mistyped my license plate number - ...
I accidentally sent the wrong cover letter on my a...
I acquired c diff at north shore hospital in Evans...
I am a 35 year resident of Evanston (314 Custer Av...
I am a Cubs fan and watched game seven. But using ...
I am a Northwestern student that has accumulated t...


# 8. Create an "other" category
### Exercises
If we want to summarize Evanston 311 requests by zip code, it would be useful to group all of the low frequency zip codes together in an "other" category.

Which of the following values, when substituted for ??? in the query, would give the result below?

Query:

SELECT CASE WHEN zipcount < ??? THEN 'other'\
       ELSE zip\
       END AS zip_recoded,\
       sum(zipcount) AS zipsum\
  FROM (SELECT zip, count(*) AS zipcount\
          FROM evanston311\
         GROUP BY zip) AS fullcounts\
 GROUP BY zip_recoded\
 ORDER BY zipsum DESC;

Result:

zip_recoded    zipsum\
60201          19054\
60202          11165\
null           5528\
other          429\
60208          255

### Possible Answers:
- A. 255
- B. 1000
- C. 100
- D. 60201

Answer: C

# 9. Group and recode values
### Exercises
There are almost 150 distinct values of evanston311.category. But some of these categories are similar, with the form "Main Category - Details". We can get a better sense of what requests are common if we aggregate by the main category.

To do this, create a temporary table recode mapping distinct category values to new, standardized values. Make the standardized values the part of the category before a dash ('-'). Extract this value with the split_part() function:

split_part(string text, delimiter text, field int)
You'll also need to do some additional cleanup of a few cases that don't fit this pattern.

Then the evanston311 table can be joined to recode to group requests by the new standardized category values.

### task 1
### Instruction
Create recode with a standardized column; use split_part() and then rtrim() to remove any remaining whitespace on the result of split_part().

In [18]:
%%sql

-- Fill in the command below with the name of the temp table
DROP TABLE IF EXISTS recode;

-- Create and name the temporary table
CREATE TEMP TABLE recode AS
-- Write the select query to generate the table 
-- with distinct values of category and standardized values
  SELECT DISTINCT category, 
         rtrim(split_part(category, '-', 1)) AS standardized
    -- What table are you selecting the above values from?
    FROM evanston311;
    
-- Look at a few values before the next step
SELECT DISTINCT standardized 
  FROM recode
 WHERE standardized LIKE 'Trash%Cart'
    OR standardized LIKE 'Snow%Removal%';

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
149 rows affected.
6 rows affected.


standardized
Snow Removal
Snow Removal/Concerns
Snow/Ice/Hazard Removal
Trash Cart
"Trash Cart, Recycling Cart"
"Trash, Recycling, Yard Waste Cart"


### task 2
### Instruction
- UPDATE standardized values LIKE 'Trash%Cart' to 'Trash Cart'.
- UPDATE standardized values of 'Snow Removal/Concerns' and 'Snow/Ice/Hazard Removal' to 'Snow Removal'.

In [19]:
%%sql

-- Code from previous step
DROP TABLE IF EXISTS recode;

CREATE TEMP TABLE recode AS
  SELECT DISTINCT category, 
         rtrim(split_part(category, '-', 1)) AS standardized
    FROM evanston311;

-- Update to group trash cart values
UPDATE recode 
   SET standardized='Trash Cart' 
 WHERE standardized LIKE 'Trash%Cart';

-- Update to group snow removal values
UPDATE recode 
   SET standardized='Snow Removal' 
 WHERE standardized LIKE 'Snow%Removal%';
    
-- Examine effect of updates
SELECT DISTINCT standardized 
  FROM recode
 WHERE standardized LIKE 'Trash%Cart'
    OR standardized LIKE 'Snow%Removal%';

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
149 rows affected.
3 rows affected.
5 rows affected.
2 rows affected.


standardized
Snow Removal
Trash Cart


### task 3
### Instruction
UPDATE recode by setting standardized values of 'THIS REQUEST IS INACTIVE…Trash Cart', '(DO NOT USE) Water Bill', 'DO NOT USE Trash', and 'NO LONGER IN USE' to 'UNUSED'.

In [20]:
%%sql

-- Code from previous step
DROP TABLE IF EXISTS recode;

CREATE TEMP TABLE recode AS
  SELECT DISTINCT category, 
         rtrim(split_part(category, '-', 1)) AS standardized
    FROM evanston311;
  
UPDATE recode SET standardized='Trash Cart' 
 WHERE standardized LIKE 'Trash%Cart';

UPDATE recode SET standardized='Snow Removal' 
 WHERE standardized LIKE 'Snow%Removal%';

-- Update to group unused/inactive values
UPDATE recode 
   SET standardized='UNUSED' 
 WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart', 
               '(DO NOT USE) Water Bill',
               'DO NOT USE Trash', 
               'NO LONGER IN USE');

-- Examine effect of updates
SELECT DISTINCT standardized 
  FROM recode
 ORDER BY standardized
LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
149 rows affected.
3 rows affected.
5 rows affected.
4 rows affected.
10 rows affected.


standardized
ADA/Inclusion Aids
Abandoned Bicycle on City Property
Abandoned Vehicle
Accessibility
Advanced Disposal
Alarm Registration
Alleys
Amplified Sounds and/or Music
Animal Issue/Concern
Animal Service


### task 4
### Instruction
- Now, join the evanston311 and recode tables to count the number of requests with each of the standardized values
- List the most common standardized values first.

In [21]:
%%sql

-- Code from previous step
DROP TABLE IF EXISTS recode;
CREATE TEMP TABLE recode AS
  SELECT DISTINCT category, 
         rtrim(split_part(category, '-', 1)) AS standardized
  FROM evanston311;
UPDATE recode SET standardized='Trash Cart' 
 WHERE standardized LIKE 'Trash%Cart';
UPDATE recode SET standardized='Snow Removal' 
 WHERE standardized LIKE 'Snow%Removal%';
UPDATE recode SET standardized='UNUSED' 
 WHERE standardized IN ('THIS REQUEST IS INACTIVE...Trash Cart', 
               '(DO NOT USE) Water Bill',
               'DO NOT USE Trash', 'NO LONGER IN USE');

-- Select the recoded categories and the count of each
SELECT standardized, count(*)
-- From the original table and table with recoded values
  FROM evanston311 
       LEFT JOIN recode 
       -- What column do they have in common?
       ON evanston311.category = recode.category 
 -- What do you need to group by to count?
 GROUP BY standardized
 -- Display the most common val values first
 ORDER BY count DESC
    LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
149 rows affected.
3 rows affected.
5 rows affected.
4 rows affected.
10 rows affected.


standardized,count
Broken Parking Meter,6092
Trash,3699
Ask A Question / Send A Message,2595
Trash Cart,1902
Tree Evaluation,1879
Rodents,1305
Recycling,1224
Dead Animal on Public Property,1057
Child Seat Installation or Inspection,1028
Fire Prevention,880


# 10. Create a table with indicator variables
### Exercises
Determine whether medium and high priority requests in the evanston311 data are more likely to contain requesters' contact information: an email address or phone number.

Emails contain an @.
Phone numbers have the pattern of three characters, dash, three characters, dash, four characters. For example: 555-555-1212.
Use LIKE to match these patterns. Remember % matches any number of characters (even 0), and _ matches a single character. Enclosing a pattern in % (i.e. before and after your pattern) allows you to locate it within other text.

For example, '%___.com%'would allow you to search for a reference to a website with the top-level domain '.com' and at least three characters preceding it.

Create and store indicator variables for email and phone in a temporary table. LIKE produces True or False as a result, but casting a boolean (True or False) as an integer converts True to 1 and False to 0. This makes the values easier to summarize later.

### task 1
### Instructions
- Create a temp table indicators from evanston311 with three columns: id, email, and phone.
- Use LIKE comparisons to detect the email and phone patterns that are in the description, and cast the result as an integer with CAST().
- Your phone indicator should use a combination of underscores _ and dashes - to represent a standard 10-digit phone number format.
- Remember to start and end your patterns with % so that you can locate the pattern within other text!

In [22]:
%%sql

-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;

-- Create the indicators temp table
CREATE TEMP TABLE indicators AS
  -- Select id
  SELECT id, 
         -- Create the email indicator (find @)
         CAST (description LIKE '%@%' AS integer) AS email,
         -- Create the phone indicator
         CAST (description LIKE '%___-___-____%' AS integer) AS phone 
    -- What table contains the data? 
    FROM evanston311;

-- Inspect the contents of the new temp table
SELECT *
  FROM indicators
    LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
36431 rows affected.
10 rows affected.


id,email,phone
1340563,0,0
1826017,0,0
1849204,0,0
1880254,0,0
1972582,0,1
1840025,0,0
2099219,0,0
2554820,0,0
1770749,0,0
2129641,0,1


### task 2
### Instruction
- Join the indicators table to evanston311, selecting the proportion of reports including an email or phone grouped by priority.
- Include adjustments to account for issues arising from integer division.

In [23]:
%%sql

-- To clear table if it already exists
DROP TABLE IF EXISTS indicators;

-- Create the temp table
CREATE TEMP TABLE indicators AS
  SELECT id, 
         CAST (description LIKE '%@%' AS integer) AS email,
         CAST (description LIKE '%___-___-____%' AS integer) AS phone 
    FROM evanston311;
  
-- Select the column you'll group by
SELECT priority,
       -- Compute the proportion of rows with each indicator
       sum(email)/count(*)::numeric AS email_prop, 
       sum(phone)/count(*)::numeric AS phone_prop
  -- Tables to select from
  FROM evanston311
       LEFT JOIN indicators
       -- Joining condition
       ON evanston311.id=indicators.id
 -- What are you grouping by?
 GROUP BY priority;

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
36431 rows affected.
4 rows affected.


priority,email_prop,phone_prop
MEDIUM,0.0196692776327241,0.0184508268059181
NONE,0.004122203384196,0.005684651441109
HIGH,0.0113636363636363,0.0227272727272727
LOW,0.0058027079303675,0.0019342359767891
