# Assignment 5b
## Jonathan Dyer
## jbdyer
Due:  Thursday, October 31, 11:59 pm

## Determining the most popular gender neutral names

The United States Social Security administration keeps records of all births and provides some of this [data](https://www.ssa.gov/oact/babynames) to the public in a file where each line is of the format:

KY,F,1912,Dorothy,209

this is to be interpreted as "In 1912, 209 female babies were born in Kentucky who were given the first name Dorothy". 

In this exercise you are to write a pyspark program that works with RDDs to determine the most popular gender neutral names.  We define a gender neutral name as a baby name that has been given to both a boy and girl baby.  We define a popular gender neutral name as a name where the ratio of the number of boys with that name to the number of girls with that name is in the range \[0.25..4\]

#### Outline of the steps involved:

1. Read in all babynames into an RDD with `Names` named tuple
2. Determine a count of all female names
3. Determine all gender neutral names by joining male_names and female_names
4. Filter the neutral names so that the ratio lies between 0.25 and 4 (inclusive)
5. Sort the filtered names in descending order by the total number of babies with that name

#### Working with groups in an RDD

Recall that our pespective is that an RDD is super-power enabled list --- resilience and distribution are builtin. As with regular Python lists elements of an RDD can in general be anything.  Some RDD operations require a _key_ e.g., `reduceByKey`, `groupByKey`, `sortByKey`.  In such instance the RDD has to have the form `[ (key, value), (key, value), ...]` i.e., the top  level has to be top of two elements--- a key and a value.  The value itself can have any structure. Hence, on occasion you may have to rearrange the elements of an RDD.  For example if you have an RDD with \[('Jack', 87), ('Jill', 92)\] and you want to sort you will need to do:
```
r.map(lambda x: ( x[1], x[0] ))  # rearrange the tuples
 .sortByValues()
```


The whole babynames file from the SSA has 6028151 records (lines) and information on 311,155,210 babies.   To facilitate development, I've sampled 100,000 lines in the file `babynames-100k.csv`.  During development, working with the sample.  Once done, set `sample` to `False` run your code and submit your notebook

In [1]:
sample = False
if sample:
    file_name = 'babynames-100k.csv'
else:
    file_name = 'babynames2018_state_gender_year_fname_number.csv'

#### Initialize

In [2]:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext() 

In [3]:
from collections import namedtuple
Names = namedtuple('Names', 
                   ['state', 'gender', 'year', 'fname', 'number'])

### 1. read baby names

In [5]:
lines = sc.textFile(file_name)
baby_names = lines.map(lambda line: Names(*line.split(',')))
baby_names.take(5)

[Names(state='AK', gender='F', year='1910', fname='Mary', number='14'),
 Names(state='AK', gender='F', year='1910', fname='Annie', number='12'),
 Names(state='AK', gender='F', year='1910', fname='Anna', number='10'),
 Names(state='AK', gender='F', year='1910', fname='Margaret', number='8'),
 Names(state='AK', gender='F', year='1910', fname='Helen', number='7')]

### 2. count of male names

In [6]:
def seq(acc, elem):
    return acc + int(elem)

def comb(a1, a2):
    return a1+a2

In [7]:
male_names = (
    baby_names.filter(lambda x: x.gender=='M')
            .map(lambda x: (x.fname, x.number))
            .aggregateByKey(0, seq, comb)
            .map(lambda x: (x[1], x[0]))
            )

In [8]:
(male_names.sortByKey(ascending=False)
            .take(5))

[(4997327, 'James'),
 (4869607, 'John'),
 (4734038, 'Robert'),
 (4349307, 'Michael'),
 (3890923, 'William')]

### 3. count of female names

In [9]:
female_names = (
    baby_names.filter(lambda x: x.gender=='F')
            .map(lambda x: (x.fname, x.number))
            .aggregateByKey(0, seq, comb)
            .map(lambda x: (x[1], x[0]))
            )

In [10]:
(female_names.sortByKey(ascending=False)
            .take(5))

[(3741196, 'Mary'),
 (1569022, 'Patricia'),
 (1537684, 'Elizabeth'),
 (1466161, 'Jennifer'),
 (1447943, 'Linda')]

### 3. gender neutral names

Please note that you join two RDDs on their keys i.e., the first element of the tuple.

In [11]:
m1 = male_names.map(lambda x: (x[1], x[0]))
f1 = female_names.map(lambda x: (x[1], x[0]))
neutral_names = m1.join(f1)
neutral_names.count()

3042

In [12]:
neutral_names.take(5)

[('Jack', (676738, 503)),
 ('Walter', (563180, 814)),
 ('Fred', (297485, 119)),
 ('Raymond', (750015, 773)),
 ('Norman', (243136, 72))]

### 4. top 10 most popular neutral names

In [13]:
top10 = (
    neutral_names.filter(lambda x: (0.25 <= x[1][0] / x[1][1] <= 4))
                 .map(lambda x: (sum(x[1]), x))
                 .sortByKey(ascending=False)
)
top10.take(10)

[(533413, ('Willie', (412266, 121147))),
 (501077, ('Jordan', (371032, 130045))),
 (425506, ('Taylor', (105876, 319630))),
 (367819, ('Leslie', (103893, 263926))),
 (348205, ('Jamie', (82429, 265776))),
 (322741, ('Angel', (229363, 93378))),
 (271308, ('Lee', (215694, 55614))),
 (237358, ('Dana', (48707, 188651))),
 (229814, ('Jessie', (99340, 130474))),
 (226814, ('Marion', (63382, 163432)))]