In [1]:
import pandas as pd
import sqlalchemy as sa
import psycopg2 as ps
from sqlalchemy import create_engine

In [2]:
%load_ext sql
%sql postgresql://postgres:lingga28@localhost:2828/datacamp
conn = create_engine('postgresql://postgres:lingga28@localhost/datacamp')

# 1. Division
### Exercises
Compute the average revenue per employee for Fortune 500 companies by sector.

### Instructions
- Compute revenue per employee by dividing revenues by employees; use casting to produce a numeric result.
- Take the average of revenue per employee with avg(); alias this as avg_rev_employee.
- Group by sector.
- Order by the average revenue per employee.

In [3]:
%%sql

-- Select average revenue per employee by sector
SELECT sector, 
       avg(revenues/employees::numeric) AS avg_rev_employee
  FROM fortune500
 GROUP BY sector
 -- Use the column alias to order the results
 ORDER BY avg_rev_employee;

 * postgresql://postgres:***@localhost:2828/datacamp
21 rows affected.


sector,avg_rev_employee
"Hotels, Restaurants & Leisure",0.0949871815105681
Apparel,0.2786594297668006
Food & Drug Stores,0.307999504100602
Motor Vehicles & Parts,0.3425271242465952
Household Products,0.3555733896959535
Retailing,0.3601945609207808
Industrials,0.3614854337614634
Aerospace & Defense,0.3667149924862827
Transportation,0.4036535247732958
Business Services,0.4201099421016663


# 2. Explore with division
### Exercises
In exploring a new database, it can be unclear what the data means and how columns are related to each other.

What information does the unanswered_pct column in the stackoverflow table contain? Is it the percent of questions with the tag that are unanswered (unanswered ?s with tag/all ?s with tag)? Or is it something else, such as the percent of all unanswered questions on the site with the tag (unanswered ?s with tag/all unanswered ?s)?

Divide unanswered_count (unanswered ?s with tag) by question_count (all ?s with tag) to see if the value matches that of unanswered_pct to determine the answer.

### Instructions
- Exclude rows where question_count is 0 to avoid a divide by zero error.
- Limit the result to 10 rows.

In [4]:
%%sql

-- Divide unanswered_count by question_count
SELECT unanswered_count/question_count::numeric AS computed_pct, 
       -- What are you comparing the above quantity to?
       unanswered_pct
  FROM stackoverflow
 -- Select rows where question_count is not 0
 WHERE question_count != 0
 LIMIT 10;

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


computed_pct,unanswered_pct
0.4654847645429362,0.001751857
0.3863636363636363,0.000116972
0.3937677053824362,5.8e-05
0.3318965517241379,1.61e-05
0.4292857142857142,0.000125312
0.3479896172925006,0.012886449
0.3508386217225587,0.007619406
0.3072916666666666,1.23e-05
0.3542805100182149,8.11e-05
0.3806577661999348,0.000243743


# 3. Summarize numeric columns
### Exercises
Summarize the profit column in the fortune500 table using the functions you've learned.

You can access the course slides for reference using the PDF icon in the upper right corner of the screen.

### task 1
### Instruction
- Compute the min(), avg(), max(), and stddev() of profits.

In [5]:
%%sql

-- Select min, avg, max, and stddev of fortune500 profits
SELECT stddev(profits),
       min(profits),
       avg(profits),
       max(profits)
  FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


stddev,min,avg,max
3940.495363490788,-6177,1783.4753507014027,45687


### task 2
### Exercises
- Now repeat step 1, but summarize profits by sector.
- Order the results by the average profits for each sector.

In [6]:
%%sql

-- Select sector and summary measures of fortune500 profits
SELECT sector,
       min(profits),
       max(profits),
       avg(profits),
       stddev(profits)
  FROM fortune500
 -- What to group by?
 GROUP BY sector
 -- Order by the average profits
 ORDER BY avg;

 * postgresql://postgres:***@localhost:2828/datacamp
21 rows affected.


sector,min,max,avg,stddev
Energy,-6177.0,7840.0,10.444642857142856,2264.572142925951
Materials,-440.0,1027.0,272.4684210526316,406.632781447055
Engineering & Construction,15.0,911.8,390.1692307692308,277.66512019762
Wholesalers,-199.4,2258.0,391.2793103448276,532.171183776766
Retailing,-2221.0,13643.0,991.7851063829788,2348.342559077222
Chemicals,-3.9,4318.0,1137.0214285714285,1129.752304492226
Business Services,57.2,5991.0,1155.355,1454.360686992199
Food & Drug Stores,-502.2,4173.0,1217.4285714285713,1613.041448851915
Apparel,396.0,3760.0,1263.7,1419.134570786013
"Hotels, Restaurants & Leisure",348.0,4686.5,1451.06,1372.975732730432


# 4. Summarize group statistics
### Exercises
Sometimes you want to understand how a value varies across groups. For example, how does the maximum value per group vary across groups?

To find out, first summarize by group, and then compute summary statistics of the group results. One way to do this is to compute group values in a subquery, and then summarize the results of the subquery.

For this exercise, what is the standard deviation across tags in the maximum number of Stack Overflow questions per day? What about the mean, min, and max of the maximums as well?

### Instructions
- Start by writing a subquery to compute the max() of question_count per tag; alias the subquery result as maxval.
- Then compute the standard deviation of maxval with stddev().
- Compute the min(), max(), and avg() of maxval too.

In [7]:
%%sql

-- Compute standard deviation of maximum values
SELECT stddev(maxval),
        -- min
       min(maxval),
       -- max
       max(maxval),
       -- avg
       avg(maxval)
  -- Subquery to compute max of question_count by tag
  FROM (SELECT max(question_count) AS maxval
          FROM stackoverflow
         -- Compute max by...
         GROUP BY tag) AS max_results; -- alias for subquery

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


stddev,min,max,avg
176458.3795272,30,1138658,52652.43396226415


# 5. Truncate
### Exercises
Use trunc() to examine the distributions of attributes of the Fortune 500 companies.

Remember that trunc() truncates numbers by replacing lower place value digits with zeros:

trunc(value_to_truncate, places_to_truncate)
Negative values for places_to_truncate indicate digits to the left of the decimal to replace, while positive values indicate digits to the right of the decimal to keep.

### task 1
### Instructions
- Use trunc() to truncate employees to the 100,000s (5 zeros).
- Count the number of observations with each truncated value.

In [8]:
%%sql

-- Truncate employees
SELECT TRUNC(employees, -5) AS employee_bin,
       -- Count number of companies with each truncated value
       COUNT(*)
  FROM fortune500
 -- Use alias to group
 GROUP BY employee_bin
 -- Use alias to order
 ORDER BY employee_bin;

 * postgresql://postgres:***@localhost:2828/datacamp
6 rows affected.


employee_bin,count
0,433
100000,35
200000,20
300000,7
400000,4
2300000,1


### task 2
### Instruction
- Repeat step 1 for companies with < 100,000 employees (most common).
- This time, truncate employees to the 10,000s place.

In [9]:
%%sql

-- Truncate employees
SELECT TRUNC(employees, -4) AS employee_bin,
       -- Count number of companies with each truncated value
       count(*)
  FROM fortune500
 -- Limit to which companies?
 WHERE employees < 100000
 -- Use alias to group
 GROUP BY employee_bin
 -- Use alias to order
 ORDER BY employee_bin;

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


employee_bin,count
0,102
10000,108
20000,63
30000,42
40000,35
50000,31
60000,18
70000,18
80000,6
90000,10


# 6. Generate series
### Exercises
Summarize the distribution of the number of questions with the tag "dropbox" on Stack Overflow per day by binning the data.

Recall:

generate_series(from, to, step)
You can reference the slides using the PDF icon in the upper right corner of the screen.

### task 1
### Instruction
Start by selecting the minimum and maximum of the question_count column for the tag 'dropbox' so you know the range of values to cover with the bins.

In [10]:
%%sql

-- Select the min and max of question_count
SELECT min(question_count), 
       max(question_count)
  -- From what table?
  FROM stackoverflow
 -- For tag dropbox
 WHERE tag = 'dropbox';

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


min,max
2315,3072


### task 2
### Instruction
- Next, use generate_series() to create bins of size 50 from 2200 to 3100.
- To do this, you need an upper and lower bound to define a bin.
- This will require you to modify the stopping value of the lower bound and the starting value of the upper bound by the bin width.

In [11]:
%%sql

-- Create lower and upper bounds of bins
SELECT generate_series(2200, 3050, 50) AS lower,
       generate_series(2250, 3100, 50) AS upper;

 * postgresql://postgres:***@localhost:2828/datacamp
18 rows affected.


lower,upper
2200,2250
2250,2300
2300,2350
2350,2400
2400,2450
2450,2500
2500,2550
2550,2600
2600,2650
2650,2700


### task 3
### Instruction
- Select lower and upper from bins, along with the count of values within each bin bounds.
- To do this, you'll need to join 'dropbox', which contains the question_count for tag "dropbox", to the bins created by generate_series().
- The join should occur where the count is greater than or equal to the lower bound, and strictly less than the upper bound.

In [12]:
%%sql

-- Bins created in Step 2
WITH bins AS (
      SELECT generate_series(2200, 3050, 50) AS lower,
             generate_series(2250, 3100, 50) AS upper),
     -- Subset stackoverflow to just tag dropbox (Step 1)
     dropbox AS (
      SELECT question_count 
        FROM stackoverflow
       WHERE tag='dropbox') 
-- Select columns for result
-- What column are you counting to summarize?
SELECT lower, upper, count(question_count) 
  FROM bins  -- Created above
       -- Join to dropbox (created above), 
       -- keeping all rows from the bins table in the join
       LEFT JOIN dropbox
       -- Compare question_count to lower and upper
         ON question_count >= lower 
        AND question_count < upper
 -- Group by lower and upper to count values in each bin
 GROUP BY lower, upper
 -- Order by lower to put bins in order
 ORDER BY lower;

 * postgresql://postgres:***@localhost:2828/datacamp
18 rows affected.


lower,upper,count
2200,2250,0
2250,2300,0
2300,2350,22
2350,2400,39
2400,2450,54
2450,2500,53
2500,2550,45
2550,2600,41
2600,2650,46
2650,2700,57


# 7. Correlation
### Exercises
What's the relationship between a company's revenue and its other financial attributes? Compute the correlation between revenues and other financial variables with the corr() function.

### Instructions
- Compute the correlation between revenues and profits.
- Compute the correlation between revenues and assets.
- Compute the correlation between revenues and equity.

In [13]:
%%sql

-- Correlation between revenues and profit
SELECT corr(revenues, profits) AS rev_profits,
	   -- Correlation between revenues and assets
       corr(revenues, assets) AS rev_assets,
       -- Correlation between revenues and equity
       corr(revenues, equity) AS rev_equity 
  FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


rev_profits,rev_assets,rev_equity
0.5999935815724783,0.3294995213185059,0.5465709997184311


# 8. Mean and Median
### Exercises
Compute the mean (avg()) and median assets of Fortune 500 companies by sector.

Use the percentile_disc() function to compute the median:

percentile_disc(0.5) 
WITHIN GROUP (ORDER BY column_name)

### Instructions
- Select the mean and median of assets.
- Group by sector.
- Order the results by the mean.

In [14]:
%%sql

-- What groups are you computing statistics by?
SELECT sector,
       -- Select the mean of assets with the avg function
       avg(assets) AS mean,
       -- Select the median
       percentile_disc(0.5) WITHIN GROUP (ORDER BY assets) AS median
  FROM fortune500
 -- Computing statistics for each what?
 GROUP BY sector
 -- Order results by a value of interest
 ORDER BY mean;

 * postgresql://postgres:***@localhost:2828/datacamp
21 rows affected.


sector,mean,median
Engineering & Construction,8199.23076923077,8709
Wholesalers,9362.586206896553,5390
Materials,10833.263157894737,7741
Apparel,11064.8,9739
Retailing,14473.148936170212,7858
"Hotels, Restaurants & Leisure",16795.4,14330
Business Services,19626.1,12485
Chemicals,20151.214285714286,15769
Household Products,23179.083333333332,10231
Food & Drug Stores,24630.714285714286,17464


# 9. Create a temp table
### Exercises
Find the Fortune 500 companies that have profits in the top 20% for their sector (compared to other Fortune 500 companies).

To do this, first, find the 80th percentile of profit for each sector with

percentile_disc(fraction) \
WITHIN GROUP (ORDER BY sort_expression)\

and save the results in a temporary table.

Then join `fortune500` to the temporary table to select companies with profits greater than the 80th percentile cut-off.

### task 1
### Instructions
- Create a temporary table called profit80 containing the sector and 80th percentile of profits for each sector.
- Alias the percentile column as pct80.

In [15]:
%%sql

 -- To clear table if it already exists;
-- fill in name of temp table
DROP TABLE IF EXISTS profit80;

-- Create the temporary table
CREATE TEMP TABLE profit80 AS 
  -- Select the two columns you need; alias as needed
  SELECT sector, 
         percentile_disc(0.8) WITHIN GROUP (ORDER BY profits) AS pct80
    -- What table are you getting the data from?
    FROM fortune500
   -- What do you need to group by?
   GROUP BY sector;
   
-- See what you created: select all columns and rows 
-- from the table you created
SELECT * 
  FROM profit80;

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
21 rows affected.
21 rows affected.


sector,pct80
Aerospace & Defense,4895.0
Apparel,1074.1
Business Services,1401.0
Chemicals,1500.0
Energy,1311.0
Engineering & Construction,602.7
Financials,3014.0
Food & Drug Stores,2025.7
"Food, Beverages & Tobacco",6073.0
Health Care,4965.0


### task 2
### Instruction
- Using the profit80 table you created in step 1, select companies that have profits greater than pct80.
- Select the title, sector, profits from fortune500, as well as the ratio of the company's profits to the 80th percentile profit.

In [16]:
%%sql

-- Code from previous step
DROP TABLE IF EXISTS profit80;

CREATE TEMP TABLE profit80 AS
  SELECT sector, 
         percentile_disc(0.8) WITHIN GROUP (ORDER BY profits) AS pct80
    FROM fortune500 
   GROUP BY sector;

-- Select columns, aliasing as needed
SELECT title, fortune500.sector, 
       profits, profits/pct80 AS ratio
-- What tables do you need to join?  
  FROM fortune500 
       LEFT JOIN profit80
-- How are the tables joined?
       ON fortune500.sector=profit80.sector
-- What rows do you want to select?
 WHERE profits > pct80
    LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
21 rows affected.
10 rows affected.


title,sector,profits,ratio
Lockheed Martin,Aerospace & Defense,5302.0,1.0831460674157305
United Technologies,Aerospace & Defense,5055.0,1.0326864147088866
Nike,Apparel,3760.0,3.500605157806536
S&P Global,Business Services,2106.0,1.5032119914346895
Mastercard,Business Services,4059.0,2.897216274089936
ADP,Business Services,1492.5,1.0653104925053531
Visa,Business Services,5991.0,4.2762312633832975
DuPont,Chemicals,2513.0,1.6753333333333331
Dow Chemical,Chemicals,4318.0,2.8786666666666667
PPL,Energy,1902.0,1.4508009153318078


# 10. Create a temp table to simplify a query
### Exercises
The Stack Overflow data contains daily question counts through 2018-09-25 for all tags, but each tag has a different starting date in the data.

Find out how many questions had each tag on the first date for which data for the tag is available, as well as how many questions had the tag on the last day. Also, compute the difference between these two values.

To do this, first compute the minimum date for each tag.

Then use the minimum dates to select the question_count on both the first and last day. To do this, join the temp table startdates to two different copies of the stackoverflow table: one for each column - first day and last day - aliased with different names.

### task 1
### Instruction
First, create a temporary table called startdates with each tag and the min() date for the tag in stackoverflow.

In [17]:
%%sql

-- To clear table if it already exists
DROP TABLE IF EXISTS startdates;

-- Create temp table syntax
CREATE TEMP TABLE startdates AS
-- Compute the minimum date for each what?
SELECT tag,
       min(date) AS mindate
  FROM stackoverflow
 -- What do you need to compute the min date for each tag?
 GROUP BY tag;
 
 -- Look at the table you created
 SELECT * 
   FROM startdates
LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
53 rows affected.
10 rows affected.


tag,mindate
amazon-route53,2016-01-01
google-spreadsheet,2016-01-01
dropbox,2016-01-01
amazon-data-pipeline,2016-09-01
amazon,2016-01-01
amazon-sns,2016-09-01
ios,2016-01-01
amazon-web-services,2016-01-01
amazon-cloudsearch,2016-01-01
amazon-ses,2016-09-01


### task 2
### Instruction
- Join startdates to stackoverflow twice using different table aliases.
- For each tag, select mindate, question_count on the mindate, and question_count on 2018-09-25 (the max date).
- Compute the change in question_count over time.

In [18]:
%%sql

-- To clear table if it already exists
DROP TABLE IF EXISTS startdates;

CREATE TEMP TABLE startdates AS
SELECT tag, min(date) AS mindate
  FROM stackoverflow
 GROUP BY tag;
 
-- Select tag (Remember the table name!) and mindate
SELECT startdates.tag, 
       mindate, 
       -- Select question count on the min and max days
	     so_min.question_count AS min_date_question_count,
       so_max.question_count AS max_date_question_count,
       -- Compute the change in question_count (max- min)
       so_max.question_count - so_min.question_count AS change
  FROM startdates
       -- Join startdates to stackoverflow with alias so_min
       INNER JOIN stackoverflow AS so_min
          -- What needs to match between tables?
          ON startdates.tag = so_min.tag
         AND startdates.mindate = so_min.date
       -- Join to stackoverflow again with alias so_max
       INNER JOIN stackoverflow AS so_max
       	  -- Again, what needs to match between tables?
          ON startdates.tag = so_max.tag
         AND so_max.date = '2018-09-25'
            LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
53 rows affected.
10 rows affected.


tag,mindate,min_date_question_count,max_date_question_count,change
applepay,2017-03-18,222,357,135
applepayjs,2017-03-18,11,30,19
android-pay,2017-03-17,444,490,46
amazon-kinesis,2016-09-01,259,766,507
amazon-sns,2016-09-01,690,1400,710
amazon-emr,2016-09-01,557,3046,2489
amazon-swf,2016-09-01,167,232,65
amazon-ecs,2016-09-01,145,1074,929
amazon-rds,2016-09-01,1156,2537,1381
amazon-ses,2016-09-01,481,934,453


# 11. Insert into a temp table
### Exercises
While you can join the results of multiple similar queries together with UNION, sometimes it's easier to break a query down into steps. You can do this by creating a temporary table and inserting rows into it.

Compute the correlations between each pair of profits, profits_change, and revenues_change from the Fortune 500 data.

The resulting temporary table should have the following structure:

|measure	      |profits	| profits_change|revenues_change|
|-----------------|---------|---------------|---------------|
|profits	      | 1.00	|  #	        | #             |
|profits_change   | #       | 1.00	        |  #            |
|revenues_change  | #    	| #	            | 1.00          |

Recall the round() function to make the results more readable:

round(column_name::numeric, decimal_places)
Note that Steps 1 and 2 do not produce output. It is normal for the query result pane to say "Your query did not generate any results."

### task 1
### Instruction
Create a temp table correlations.
- Compute the correlation between profits and each of the three variables (i.e. correlate profits with profits, profits with profits_change, etc).
- Alias columns by the name of the variable for which the correlation with profits is being computed.

In [19]:
%%sql

DROP TABLE IF EXISTS correlations;

-- Create temp table 
CREATE TEMP TABLE correlations AS
-- Select each correlation
SELECT 'profits'::varchar AS measure,
       -- Compute correlations
       corr(profits, profits) AS profits,
       corr(profits, profits_change) AS profits_change,
       corr(profits, revenues_change) AS revenues_change
  FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
1 rows affected.


[]

### task 2
### Instruction
Insert rows into the correlations table for profits_change and revenues_change.

In [20]:
%%sql

DROP TABLE IF EXISTS correlations;

CREATE TEMP TABLE correlations AS
SELECT 'profits'::varchar AS measure,
       corr(profits, profits) AS profits,
       corr(profits, profits_change) AS profits_change,
       corr(profits, revenues_change) AS revenues_change
  FROM fortune500;

-- Add a row for profits_change
-- Insert into what table?
INSERT INTO correlations
-- Follow the pattern of the select statement above
-- Using profits_change instead of profits
SELECT 'profits_change'::varchar AS measure,
       corr(profits_change, profits) AS profits,
       corr(profits_change, profits_change) AS profits_change,
       corr(profits_change, revenues_change) AS revenues_change
  FROM fortune500;

-- Repeat the above, but for revenues_change
INSERT INTO correlations
SELECT 'revenues_change'::varchar AS measure,
       corr(revenues_change, profits) AS profits,
       corr(revenues_change, profits_change) AS profits_change,
       corr(revenues_change, revenues_change) AS revenues_change
  FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
1 rows affected.
1 rows affected.
1 rows affected.


[]

### task 3
### Instruction
- Select all rows and columns from the correlations table to view the correlation matrix.
- First, you will need to round each correlation to 2 decimal places.
- The output of corr() is of type double precision, so you will need to also cast columns to numeric.

In [21]:
%%sql
DROP TABLE IF EXISTS correlations;

CREATE TEMP TABLE correlations AS
SELECT 'profits'::varchar AS measure,
       corr(profits, profits) AS profits,
       corr(profits, profits_change) AS profits_change,
       corr(profits, revenues_change) AS revenues_change
  FROM fortune500;

INSERT INTO correlations
SELECT 'profits_change'::varchar AS measure,
       corr(profits_change, profits) AS profits,
       corr(profits_change, profits_change) AS profits_change,
       corr(profits_change, revenues_change) AS revenues_change
  FROM fortune500;

INSERT INTO correlations
SELECT 'revenues_change'::varchar AS measure,
       corr(revenues_change, profits) AS profits,
       corr(revenues_change, profits_change) AS profits_change,
       corr(revenues_change, revenues_change) AS revenues_change
  FROM fortune500;

-- Select each column, rounding the correlations
SELECT measure, 
       round(profits::numeric, 2) AS profits,
       round(profits_change::numeric, 2) AS profits_change,
       round(revenues_change::numeric, 2) AS revenues_change
  FROM correlations;

 * postgresql://postgres:***@localhost:2828/datacamp
Done.
1 rows affected.
1 rows affected.
1 rows affected.
3 rows affected.


measure,profits,profits_change,revenues_change
profits,1.0,0.02,0.02
profits_change,0.02,1.0,-0.09
revenues_change,0.02,-0.09,1.0
