# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [81]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.sql import Window

spark = SparkSession.builder.appName("Baby_names").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

### Step 3. Assign it to a variable called baby_names.

In [82]:
schema = T.StructType([
    T.StructField("Extra", T.IntegerType(), True),
    T.StructField("id", T.IntegerType(), True),
    T.StructField("name", T.StringType(), True),
    T.StructField("year", T.IntegerType(), True),
    T.StructField("gender", T.StringType(), True),
    T.StructField("state", T.StringType(), True),
    T.StructField("count", T.IntegerType(), True),
])

In [83]:
baby_names = spark.read.csv("US_Baby_Names_right.csv", header=True, schema=schema)

### Step 4. See the first 10 entries

In [84]:
baby_names.head(10)

[Row(Extra=11349, id=11350, name='Emma', year=2004, gender='F', state='AK', count=62),
 Row(Extra=11350, id=11351, name='Madison', year=2004, gender='F', state='AK', count=48),
 Row(Extra=11351, id=11352, name='Hannah', year=2004, gender='F', state='AK', count=46),
 Row(Extra=11352, id=11353, name='Grace', year=2004, gender='F', state='AK', count=44),
 Row(Extra=11353, id=11354, name='Emily', year=2004, gender='F', state='AK', count=41),
 Row(Extra=11354, id=11355, name='Abigail', year=2004, gender='F', state='AK', count=37),
 Row(Extra=11355, id=11356, name='Olivia', year=2004, gender='F', state='AK', count=33),
 Row(Extra=11356, id=11357, name='Isabella', year=2004, gender='F', state='AK', count=30),
 Row(Extra=11357, id=11358, name='Alyssa', year=2004, gender='F', state='AK', count=29),
 Row(Extra=11358, id=11359, name='Sophia', year=2004, gender='F', state='AK', count=28)]

### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [85]:
baby_names = baby_names.drop('Extra', 'id')

In [86]:
baby_names.head(5)

[Row(name='Emma', year=2004, gender='F', state='AK', count=62),
 Row(name='Madison', year=2004, gender='F', state='AK', count=48),
 Row(name='Hannah', year=2004, gender='F', state='AK', count=46),
 Row(name='Grace', year=2004, gender='F', state='AK', count=44),
 Row(name='Emily', year=2004, gender='F', state='AK', count=41)]

### Step 6. Is there more male or female names in the dataset?

In [87]:
baby_names.select('gender').groupBy('gender').count().show()

+------+------+
|gender| count|
+------+------+
|     F|558846|
|     M|457549|
+------+------+



### Step 7. Group the dataset by name and assign to names

In [88]:
name = baby_names.select('name', 'count').groupBy('name').sum().withColumnRenamed('sum(count)', 'count')

In [89]:
name.head(5)

[Row(name='Kiana', count=5965),
 Row(name='Alayna', count=14171),
 Row(name='Ember', count=3181),
 Row(name='Tyler', count=129989),
 Row(name='Maddox', count=20716)]

### Step 8. How many different names exist in the dataset?

In [90]:
name.count()

17632

### Step 9. What is the name with most occurrences?

In [91]:
name.orderBy('count', ascending=False).show(1)

+-----+------+
| name| count|
+-----+------+
|Jacob|242874|
+-----+------+
only showing top 1 row



### Step 10. How many different names have the least occurrences?

In [92]:
name.withColumn("row_num", F.rank().over(Window.orderBy("count"))).\
    where(F.col('row_num') == 1).count()


2578

### Step 11. What is the median name occurrence?

In [108]:
name.withColumn('count_median', F.lit(name.approxQuantile('count',[0.5], 0.00001)[0])).\
    select('name', 'count').where(F.col('count') == F.col('count_median')).count()

66

### Step 12. What is the standard deviation of names?

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [96]:
name.select('count').summary().show()

+-------+------------------+
|summary|             count|
+-------+------------------+
|  count|             17632|
|   mean| 2008.932168784029|
| stddev|11006.069467890562|
|    min|                 5|
|    25%|                11|
|    50%|                49|
|    75%|               337|
|    max|            242874|
+-------+------------------+

