In [1]:
import pandas as pd
import sqlalchemy as sa
import psycopg2 as ps
from sqlalchemy import create_engine

In [2]:
%load_ext sql
%sql postgresql://postgres:lingga28@localhost:2828/datacamp
conn = create_engine('postgresql://postgres:lingga28@localhost/datacamp')

# 1. Explore table sizes
### Exercises
Let's start by exploring five related tables:

stackoverflow: questions asked on Stack Overflow with certain tags
company: information on companies related to tags in stackoverflow
tag_company: links stackoverflow to company
tag_type: type categories applied to tags in stackoverflow
fortune500: information on top US companies
Count the number of rows in a table with

SELECT count(*) \
  FROM tablename;\
Count the number of columns in a table by selecting a few rows and manually counting the columns in the result.

Which table has the most rows? Which table has the most columns?

### Instructions
### Possible Answers:
- A. stackoverflow has the most rows; company has the most columns
- B. tag_company has the most rows; company has the most columns
- C. stackoverflow has the most rows; fortune500 has the most columns
- D. tag_type has the most rows; fortune500 has the most columns

Answer: C

# 2. Count missing values
### Exercises
Which column of fortune500 has the most missing values? To find out, you'll need to check each column individually, although here we'll check just three.

Course Note: While you're unlikely to encounter this issue during this exercise, note that if you run a query that takes more than a few seconds to execute, your session may expire or you may be disconnected from the server. You will not have this issue with any of the exercise solutions, so if your session expires or disconnects, there's an error with your query.

### task 1
### Instruction
First, figure out how many rows are in fortune500 by counting them.

In [3]:
%%sql

-- Select the count of the number of rows
SELECT COUNT(*)
  FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


count
500


### task 2
### Instruction
Subtract the count of the non-NULL ticker values from the total number of rows; alias the difference as missing.

In [4]:
%%sql

-- Select the count of ticker, 
-- subtract from the total number of rows, 
-- and alias as missing
SELECT count(*) - count(ticker) AS missing
  FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


missing
32


### task 3
### Instruction
Repeat for the profits_change column.

In [5]:
%%sql

-- Select the count of profits_change, 
-- subtract from total number of rows, and alias as missing
SELECT count(*) - count(profits_change) as missing
FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


missing
63


### task 4
### Instruction
Repeat for the industry column.

In [6]:
%%sql

-- Select the count of industry, 
-- subtract from total number of rows, and alias as missing
SELECT COUNT(*) - COUNT(industry) as missing
FROM fortune500;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


missing
13


# 3. Join tables
### Exercises
Part of exploring a database is figuring out how tables relate to each other. The company and fortune500 tables don't have a formal relationship between them in the database, but this doesn't prevent you from joining them.

To join the tables, you need to find a column that they have in common where the values are consistent across the tables. Remember: just because two tables have a column with the same name, it doesn't mean those columns necessarily contain compatible data. If you find more than one pair of columns with similar data, you may need to try joining with each in turn to see if you get the same number of results.

Reference the entity relationship diagram if needed.

### Instructions
- Look at the contents of the company and fortune500 tables. Find a column that they have in common where the values for each company are the same in both tables.
- Join the company and fortune500 tables with an INNER JOIN.
- Select only company.name for companies that appear in both tables.

In [7]:
%%sql

SELECT company.name
-- Table(s) to select from
  FROM company
       INNER JOIN fortune500
       ON company.ticker = fortune500.ticker;

 * postgresql://postgres:***@localhost:2828/datacamp
8 rows affected.


name
Apple Incorporated
Amazon.com Inc
Alphabet
Microsoft Corp.
International Business Machines Corporation
PayPal Holdings Incorporated
"eBay, Inc."
Adobe Systems Incorporated


# 4. Foreign keys
### Exercises
Recall that foreign keys reference another row in the database via a unique ID. Values in a foreign key column are restricted to values in the referenced column OR NULL.

Using what you know about foreign keys, why can't the tag column in the tag_type table be a foreign key that references the tag column in the stackoverflow table?

Remember, you can reference the slides using the icon in the upper right of the screen to review the requirements for a foreign key.

### Instructions
### Possible Answers:
- A. stackoverflow.tag is not a primary key
- B. tag_type.tag contains NULL values
- C. stackoverflow.tag contains duplicate values
- D. tag_type.tag does not contain all the values in stackoverflow.tag

Answer: C

# 5. Read an entity relationship diagram
### Exercises
The information you need is sometimes split across multiple tables in the database.

What is the most common stackoverflow tag_type? What companies have a tag of that type?

To generate a list of such companies, you'll need to join three tables together.

Reference the entity relationship diagram as needed when determining which columns to use when joining tables.

### task 1
### Instructions
- First, using the tag_type table, count the number of tags with each type.
- Order the results to find the most common tag type.

In [8]:
%%sql

-- Count the number of tags with each type
SELECT type, COUNT(*) as count
  FROM tag_type
 -- To get the count for each type, what do you need to do?
 GROUP BY type
 -- Order the results with the most common
 -- tag types listed first
 ORDER BY count(type);

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


type,count
identity,1
os,2
storage,2
spreadsheet,2
company,4
api,4
mobile-os,4
payment,5
database,6
cloud,31


### task 2
### Instruction
- Join the tag_company, company, and tag_type tables, keeping only mutually occurring records.
- Select company.name, tag_type.tag, and tag_type.type for tags with the most common type from the previous step.

In [9]:
%%sql

-- Select the 3 columns desired
SELECT company.name, tag_type.tag, tag_type.type
  FROM company
  	   -- Join to the tag_company table
       INNER JOIN tag_company 
       ON company.id = tag_company.company_id
       -- Join to the tag_type table
       INNER JOIN tag_type
       ON tag_company.tag = tag_type.tag
  -- Filter to most common type
  WHERE type='cloud'
    LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


name,tag,type
Amazon Web Services,amazon-cloudformation,cloud
Amazon Web Services,amazon-cloudfront,cloud
Amazon Web Services,amazon-cloudsearch,cloud
Amazon Web Services,amazon-cloudwatch,cloud
Amazon Web Services,amazon-cognito,cloud
Amazon Web Services,amazon-data-pipeline,cloud
Amazon Web Services,amazon-dynamodb,cloud
Amazon Web Services,amazon-ebs,cloud
Amazon Web Services,amazon-ec2,cloud
Amazon Web Services,amazon-ecs,cloud


# 6. Coalesce
### Instruction
The coalesce() function can be useful for specifying a default or backup value when a column contains NULL values.

coalesce() checks arguments in order and returns the first non-NULL value, if one exists.

coalesce(NULL, 1, 2) = 1
coalesce(NULL, NULL) = NULL
coalesce(2, 3, NULL) = 2
In the fortune500 data, industry contains some missing values. Use coalesce() to use the value of sector as the industry when industry is NULL. Then find the most common industry.

Instructions
- Use coalesce() to select the first non-NULL value from industry, sector, or 'Unknown' as a fallback value.
- Alias the result of the call to coalesce() as industry2.
- Count the number of rows with each industry2 value.
- Find the most common value of industry2.

In [10]:
%%sql

-- Use coalesce
SELECT coalesce(industry, sector, 'Unknown') AS industry2,
       -- Don't forget to count!
       COUNT(*)
  FROM fortune500 
-- Group by what? (What are you counting by?)
 GROUP BY industry2
-- Order results to see most common first
 ORDER BY COUNT DESC
-- Limit results to get just the one value you want
 LIMIT 1;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


industry2,count
Utilities: Gas and Electric,22


# 7. Coalesce with a self-join
### Exercises
You previously joined the company and fortune500 tables to find out which companies are in both tables. Now, also include companies from company that are subsidiaries of Fortune 500 companies as well.

To include subsidiaries, you will need to join company to itself to associate a subsidiary with its parent company's information. To do this self-join, use two different aliases for company.

coalesce will help you combine the two ticker columns in the result of the self-join to join to fortune500.

### Instructions
- Join company to itself to add information about a company's parent to the original company's information.
- Use coalesce to get the parent company ticker if available and the original company ticker otherwise.
- INNER JOIN to fortune500 using the ticker.
- Select original company name, fortune500 title and rank.

In [11]:
%%sql

SELECT company_original.name, title, rank
  -- Start with original company information
  FROM company AS company_original
       -- Join to another copy of company with parent
       -- company information
	   LEFT JOIN company AS company_parent
       ON company_original.parent_id = company_parent.id 
       -- Join to fortune500, only keep rows that match
       INNER JOIN fortune500 
       -- Use parent ticker if there is one, 
       -- otherwise original ticker
       ON coalesce(company_parent.ticker, 
                   company_original.ticker) = 
             fortune500.ticker
 -- For clarity, order by rank
 ORDER BY rank;

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


name,title,rank
Apple Incorporated,Apple,3
Amazon.com Inc,Amazon.com,12
Amazon Web Services,Amazon.com,12
Alphabet,Alphabet,27
Google LLC,Alphabet,27
Microsoft Corp.,Microsoft,28
International Business Machines Corporation,IBM,32
PayPal Holdings Incorporated,PayPal Holdings,264
"eBay, Inc.",eBay,310
Adobe Systems Incorporated,Adobe Systems,443


# 8. Effects of casting
When you cast data from one type to another, information can be lost or changed. See how the casting changes values and practice casting data using the CAST() function and the :: syntax.

SELECT CAST(value AS new_type);

SELECT value::new_type;

### task 1
### Instruction
- Select profits_change and profits_change cast as integer from fortune500.
- Look at how the values were converted.

In [13]:
%%sql

-- Select the original value
SELECT profits_change, 
	   -- Cast profits_change
       CAST(profits_change AS integer) AS profits_change_int
  FROM fortune500
LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


profits_change,profits_change_int
-7.2,-7
0.0,0
-14.4,-14
-51.5,-52
53.0,53
20.7,21
1.5,2
-2.7,-3
-2.8,-3
-37.7,-38


### task 2
### Instruction
Compare the results of casting of dividing the integer value 10 by 3 to the result of dividing the numeric value 10 by 3.

In [14]:
%%sql

-- Divide 10 by 3
SELECT 10/3, 
       -- Cast 10 as numeric and divide by 3
       10::numeric/3;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


?column?,?column?_1
3,3.333333333333333


### task 3
### Instruction
- Now cast numbers that appear as text as numeric.
- Note: 1e3 is scientific notation.

In [15]:
%%sql

SELECT '3.2'::numeric,
       '-123'::numeric,
       '1e3'::numeric,
       '1e-3'::numeric,
       '02314'::numeric,
       '0002'::numeric;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


numeric,numeric_1,numeric_2,numeric_3,numeric_4,numeric_5
3.2,-123,1000,0.001,2314,2


# 9. Summarize the distribution of numeric values
### Exercises
Was 2017 a good or bad year for revenue of Fortune 500 companies? Examine how revenue changed from 2016 to 2017 by first looking at the distribution of revenues_change and then counting companies whose revenue increased.

### task 1
### Instructions
- Use GROUP BY and count() to examine the values of revenues_change.
- Order the results by revenues_change to see the distribution.

In [17]:
%%sql

-- Select the count of each value of revenues_change
SELECT COUNT(revenues_change), revenues_change
  FROM fortune500
 GROUP BY revenues_change
 -- order by the values of revenues_change
 ORDER BY revenues_change
LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


count,revenues_change
1,-57.5
1,-53.3
1,-51.4
1,-50.9
1,-45.0
1,-41.7
1,-38.7
1,-38.3
1,-37.5
1,-32.8


### task 2
### Instruction
Repeat step 1, but this time, cast revenues_change as an integer to reduce the number of different values.

In [18]:
%%sql

-- Select the count of each revenues_change integer value
SELECT revenues_change::integer, COUNT(revenues_change)
  FROM fortune500
 GROUP BY revenues_change::integer
 -- order by the values of revenues_change
 ORDER BY revenues_change
LIMIT 10; --just an addition, so that the table is not elongated

 * postgresql://postgres:***@localhost:2828/datacamp
10 rows affected.


revenues_change,count
-58,1
-53,1
-51,2
-45,1
-42,1
-39,1
-38,2
-33,1
-30,1
-27,1


### task 3
### Instruction
How many of the Fortune 500 companies had revenues increase in 2017 compared to 2016? To find out, count the rows of fortune500 where revenues_change indicates an increase.

In [20]:
%%sql

-- Count rows 
SELECT count(*)
  FROM fortune500
 -- Where...
 WHERE revenues_change > 0;

 * postgresql://postgres:***@localhost:2828/datacamp
1 rows affected.


count
298
