# SELECT from WORLD

In [1]:
import os
import pandas as pd
import findspark
os.environ['SPARK_HOME'] =  '/opt/spark'
findspark.init()

from pyspark.sql import SparkSession
sc = (SparkSession.builder.appName('app02')
      .config('spark.sql.warehouse.dir', 'hdfs://quickstart.cloudera:8020/user/hive/warehouse')
      .config('hive.metastore.uris', 'thrift://quickstart.cloudera:9083')
      .enableHiveSupport().getOrCreate())

In [2]:
world = sc.read.table('sqlzoo.world')

## 1. Introduction

[Read the notes about this table](https://sqlzoo.net/wiki/Read_the_notes_about_this_table.). Observe the result of running this SQL command to show the name, continent and population of all countries.

In [3]:
world.select('name', 'continent', 'population').toPandas()

Unnamed: 0,name,continent,population
0,Afghanistan,Asia,32225560.0
1,Albania,Europe,2845955.0
2,Algeria,Africa,43000000.0
3,Andorra,Europe,77543.0
4,Angola,Africa,31127674.0
...,...,...,...
190,Venezuela,South America,32219521.0
191,Vietnam,Asia,96208984.0
192,Yemen,Asia,29825968.0
193,Zambia,Africa,17885422.0


## 2. Large Countries

[How to use WHERE to filter records](https://sqlzoo.net/wiki/WHERE_filters). Show the name for the countries that have a population of at least 200 million. 200 million is 200000000, there are eight zeros.

In [4]:
world.filter(world['population']>=2e8).select('name').toPandas()

Unnamed: 0,name
0,Brazil
1,China
2,India
3,Indonesia
4,Nigeria
5,Pakistan
6,United States


## 3. Per capita GDP

Give the `name` and the **per capita GDP** for those countries with a `population` of at least 200 million.

> _HELP:How to calculate per capita GDP_   
> per capita GDP is the GDP divided by the population GDP/population

In [5]:
(world.withColumn('pcgdp', world['gdp']/world['population'])
    .filter(world['population']>=2e8)
    .select('name', 'pcgdp')
    .toPandas())

Unnamed: 0,name,pcgdp
0,Brazil,9721.370041
1,China,8724.30644
2,India,1891.781051
3,Indonesia,3804.772286
4,Nigeria,1822.886159
5,Pakistan,1377.036279
6,United States,59121.192067


## 4. South America In millions

Show the `name` and `population` in millions for the countries of the `continent` 'South America'. Divide the population by 1000000 to get population in millions.

In [6]:
(world.withColumn('popl', world['population']/1e6)
    .filter(world['continent']=='South America')
    .select('name', 'popl')
    .toPandas())

Unnamed: 0,name,popl
0,Argentina,44.938712
1,Bolivia,11.469896
2,Brazil,211.442625
3,Chile,19.107216
4,Colombia,49.395678
5,Ecuador,17.472948
6,Guyana,0.782766
7,Paraguay,7.252672
8,Peru,32.1314
9,Saint Vincent and the Grenadines,0.110608


## 5. France, Germany, Italy

Show the `name` and `population` for France, Germany, Italy

In [7]:
(world.filter(world['name'].isin(['France', 'Germany', 'Italy']))
     .select('name', 'population')
     .toPandas())

Unnamed: 0,name,population
0,France,67076000.0
1,Germany,83149300.0
2,Italy,60238522.0


## 6. United

Show the countries which have a `name` that includes the word 'United'

In [8]:
world.filter(world['name'].contains('United')).select('name').toPandas()

Unnamed: 0,name
0,United Arab Emirates
1,United Kingdom
2,United States


## 7. Two ways to be big

Two ways to be big: A country is **big** if it has an area of more than 3 million sq km or it has a population of more than 250 million.

**Show the countries that are big by area or big by population. Show name, population and area.**

In [9]:
(world.filter((world['area']>3e6) | (world['population']>2.5e8))
    .select('name', 'population', 'area')
    .toPandas())

Unnamed: 0,name,population,area
0,Australia,25690020.0,7692024.0
1,Brazil,211442600.0,8515767.0
2,Canada,38007170.0,9984670.0
3,China,1402379000.0,9596961.0
4,India,1361503000.0,3166414.0
5,Indonesia,266911900.0,1904569.0
6,Russia,146745100.0,17125242.0
7,United States,329583900.0,9826675.0


## 8. One or the other (but not both)

**Exclusive OR (XOR). Show the countries that are big by area (more than 3 million) or big by population (more than 250 million) but not both. Show name, population and area.**

- Australia has a big area but a small population, it should be **included**.
- Indonesia has a big population but a small area, it should be **included**.
- China has a big population **and** big area, it should be **excluded**.
- United Kingdom has a small population and a small area, it should be **excluded**.

In [10]:
(world.filter((world['area']>3e6) != (world['population']>2.5e8))
     .select('name', 'population', 'area')
     .toPandas())

Unnamed: 0,name,population,area
0,Australia,25690023.0,7692024.0
1,Brazil,211442625.0,8515767.0
2,Canada,38007166.0,9984670.0
3,Indonesia,266911900.0,1904569.0
4,Russia,146745098.0,17125242.0


## 9. Rounding

Show the `name` and `population` in millions and the GDP in billions for the countries of the `continent` 'South America'. Use the [ROUND](https://sqlzoo.net/wiki/ROUND) function to show the values to two decimal places.

**For South America show population in millions and GDP in billions both to 2 decimal places.**

> _Millions and billions_    
> Divide by 1000000 (6 zeros) for millions. Divide by 1000000000 (9 zeros) for billions.

In [11]:
from pyspark.sql.functions import round
(world.filter(world['continent']=='South America')
     .withColumn('popl', round(world['population']/1e6, 2))
     .withColumn('gdp_', round(world['gdp']/1e9, 2))
    .select('name', 'popl', 'gdp_')
    .toPandas())

Unnamed: 0,name,popl,gdp_
0,Argentina,44.94,637.49
1,Bolivia,11.47,37.51
2,Brazil,211.44,2055.51
3,Chile,19.11,277.08
4,Colombia,49.4,309.19
5,Ecuador,17.47,104.3
6,Guyana,0.78,3.09
7,Paraguay,7.25,29.44
8,Peru,32.13,211.4
9,Saint Vincent and the Grenadines,0.11,0.73


## 10. Trillion dollar economies

Show the `name` and per-capita GDP for those countries with a GDP of at least one trillion (1000000000000; that is 12 zeros). Round this value to the nearest 1000.

**Show per-capita GDP for the trillion dollar countries to the nearest $1000.**

In [12]:
(world.withColumn('pcgdp', round(world['gdp']/(1000*world['population']), 0)*1000)
    .filter(world['gdp']>1e12)
    .select('name', 'pcgdp')
    .toPandas())

Unnamed: 0,name,pcgdp
0,Australia,55000.0
1,Brazil,10000.0
2,Canada,43000.0
3,China,9000.0
4,France,39000.0
5,Germany,44000.0
6,India,2000.0
7,Indonesia,4000.0
8,Italy,32000.0
9,Japan,39000.0


## 11. Name and capital have the same length

Greece has capital Athens.

Each of the strings 'Greece', and 'Athens' has 6 characters.

**Show the name and capital where the name and the capital have the same number of characters.**

- You can use the [LENGTH](https://sqlzoo.net/wiki/LENGTH) function to find the number of characters in a string

In [13]:
from pyspark.sql.functions import length
(world.filter(length(world['name'])==length(world['capital']))
    .select('name', 'capital')
    .toPandas())

Unnamed: 0,name,capital
0,Algeria,Algiers
1,Angola,Luanda
2,Armenia,Yerevan
3,Botswana,Gaborone
4,Canada,Ottowa
5,Djibouti,Djibouti
6,Egypt,Cairo
7,Estonia,Tallinn
8,Fiji,Suva
9,Gambia,Banjul


## 12. Matching name and capital

The capital of Sweden is Stockholm. Both words start with the letter 'S'.

**Show the name and the capital where the first letters of each match. Don't include countries where the name and the capital are the same word.**

- You can use the function [LEFT](https://sqlzoo.net/wiki/LEFT) to isolate the first character.
- You can use <> as the **NOT EQUALS** operator.

In [14]:
from pyspark.sql.functions import substring
(world.filter(substring(world['name'], 1, 1)==substring(world['capital'], 1, 1))
    .select('name', 'capital')
    .toPandas())

Unnamed: 0,name,capital
0,Algeria,Algiers
1,Andorra,Andorra la Vella
2,Barbados,Bridgetown
3,Belize,Belmopan
4,Brazil,Brasília
5,Brunei,Bandar Seri Begawan
6,Burundi,Bujumbura
7,Djibouti,Djibouti
8,Guatemala,Guatemala City
9,Guyana,Georgetown


## 13. All the vowels

**Equatorial Guinea** and **Dominican Republic** have all of the vowels (a e i o u) in the name. They don't count because they have more than one word in the name.

**Find the country that has all the vowels and no spaces in its name.**

- You can use the phrase name `NOT LIKE '%a%'` to exclude characters from your results.
- The query shown misses countries like Bahamas and Belarus because they contain at least one 'a'

In [15]:
(world.filter((world['name'].rlike('[Aa]')) &
              world['name'].rlike('[Ee]') &
              world['name'].rlike('[Ii]') &
              world['name'].rlike('[Oo]') &
              world['name'].rlike('[Uu]') &
              world['name'].rlike(r'^\S+$'))
    .select('name')
    .toPandas())

Unnamed: 0,name
0,Mozambique


In [16]:
sc.stop()