## Problem 06

Weightage: 25

Get cities with top ten female member count from each state. There is a chance that more than 1 city might get the same rank if the counts are same. You need to get all the cities which contain top ten female member count from each state.

## Data Description

All of the address data is available under **/public/addresses**. Here is the schema.
```
root
 |-- address: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- postal_code: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |-- email: string (nullable = true)
 |-- first_name: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- id: long (nullable = true)
 |-- ip_address: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- phone_numbers: array (nullable = true)
 |    |-- element: string (containsNull = true)
```

## Output Requirements
* Place the result in the HDFS Directory 
```
/user/`whoami`/mock_test_02/problem06/solution
```
* Use CSV and save the output to exactly one file. Make sure to preserve the header.
* Here are the column names. Data types should be same as input data.
```
 |-- state: string
 |-- city:string
 |-- female_count: long
```
* Data should be sorted in ascending order by state and then in descending order by count.

## Validation

Here are the self validation steps:
* Run the following to check number of files.
```
hdfs dfs -ls /user/`whoami`/mock_test_02/problem06/solution
```
* Run the following to validate the data. Review the data to see if it is sorted in ascending order by state and then in descending order by count.
```
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*
```
* Run this command to get the count including header. Result should be 320.
```
hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*|wc -l
```


In [1]:
from pyspark.sql import SparkSession
import getpass
username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    appName(f'Problem 06 | {username}'). \
    master('yarn'). \
    getOrCreate()

In [2]:
from pyspark.sql.functions import col,lit,count
data = spark.read.json("/public/addresses"). \
filter(col("gender")=='Female'). \
select("id",col('address.*')). \
groupBy("city","state"). \
agg(count(lit(1)).alias('female_count')). \
orderBy(col("state"),col("female_count").desc())

In [3]:
data

city,state,female_count
Birmingham,Alabama,3980
Montgomery,Alabama,2132
Mobile,Alabama,2093
Huntsville,Alabama,1036
Tuscaloosa,Alabama,544
Anniston,Alabama,253
Gadsden,Alabama,248
Anchorage,Alaska,1389
Fairbanks,Alaska,563
Juneau,Alaska,267


In [4]:
from pyspark.sql.window import Window
spec = Window. \
    partitionBy("state"). \
    orderBy(col("state"),col("female_count").desc())

In [8]:
from pyspark.sql.functions import col,lit,count,dense_rank
df= data. \
withColumn("drank",dense_rank().over(spec)). \
orderBy(col('female_count').desc()). \
filter(col("drank")<=10). \
drop("drank")

In [9]:
df.coalesce(1).write.format('csv'). \
option('header','True'). \
mode('overwrite').save("/user/itv002480/mock_test_02/problem06/solution")

In [10]:
!hdfs dfs -ls /user/`whoami`/mock_test_02/problem06/solution

Found 2 items
-rw-r--r--   3 itv002480 supergroup          0 2022-06-29 05:17 /user/itv002480/mock_test_02/problem06/solution/_SUCCESS
-rw-r--r--   3 itv002480 supergroup       7643 2022-06-29 05:17 /user/itv002480/mock_test_02/problem06/solution/part-00000-5939cbc6-b354-4ccc-809f-3c65d2bb6add-c000.csv


In [11]:
!hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*

city,state,female_count
Washington,District of Columbia,16238
Houston,Texas,9998
New York City,New York,8810
El Paso,Texas,8353
Dallas,Texas,6488
Atlanta,Georgia,6468
Sacramento,California,5500
Los Angeles,California,5344
Miami,Florida,5174
Chicago,Illinois,4777
Philadelphia,Pennsylvania,4503
San Antonio,Texas,4493
Phoenix,Arizona,4361
Charlotte,North Carolina,4329
Austin,Texas,4303
Kansas City,Missouri,4255
Oklahoma City,Oklahoma,4199
San Diego,California,4187
Denver,Colorado,4066
Cincinnati,Ohio,4059
Pittsburgh,Pennsylvania,3996
Birmingham,Alabama,3980
Las Vegas,Nevada,3923
San Francisco,California,3729
Minneapolis,Minnesota,3715
Saint Louis,Missouri,3709
Memphis,Tennessee,3689
Richmond,Virginia,3668
Seattle,Washington,3524
Louisville,Kentucky,3492
Salt Lake City,Utah,3462
Des Moines,Iowa,3244
San Jose,California,3221
Orlando,Florida,3187
New Orleans,Louisiana,3186
Fresno,California,3176
Tampa,Florida,3167
Indianapolis,Indiana,3145
Portland,Oregon,3102
Tulsa,Oklahoma,2983
Boston,Mass

In [12]:
!hdfs dfs -cat /user/`whoami`/mock_test_02/problem06/solution/part*|wc -l

320
