## Post Lab 2 - Analyse US baby names

Task 1 - Load the dataset into HDFS in the root directory. You can do so by executing the following commands to first move the file to the hdfs container then to the hdfs cluster.

In the terminal make sure that your current directory has the parquet file directly under it, you can do so doing ls and ensuring the parquet file is listed.
- `docker cp baby_names_unclean.parquet namenode-master:/`
- `docker exec -it namenode-master hdfs dfs -put -f /baby_names_unclean.parquet hdfs://namenode-master:8020/`

You can then verify that the file correctly exists in hdfs by checking the HDFS UI.

## Load the dataset -  US Baby Names 1880-2017
=======================


Description: US baby names provided by the SSA.

This dataset contains all names used
for at least 5 children of either sex during a year.


The file is made of `1924665` lines and  4 columns.

```
|-- name: string (nullable = true)
    |-- n: integer (nullable = true)
    |-- sex: string (nullable = true)
    |-- year: integer (nullable = true)
```

Each row indicates for a given name, sex, and year the number of babies
of the given sex who were given that name during the given year. Names
with less than 5 occurrences during the year were not recorded.

Ensure that the dataframe has the following schema:

    root
        |-- name: string (nullable = true)
        |-- n: integer (nullable = true)
        |-- sex: string (nullable = true)
        |-- year: integer (nullable = true)

## Tasks

1. What are the 10 most popular names for Females in year 2000.
2. What are the 10 most popular names for Males in year 2000.

3. Which year had

- a) the most distinct female names

- b) the most distinct male names

- c) the most distinct names (both male and female)

4. In the year 2010, how many names where assigned to both males and females.

5. Create a new column that shows the length of each name.

6. Create a new column that shows the total number of times the name have been given to a baby across all years.

7. Partition your dataframe based on the year the baby was born and write the dataframe to hdfs.

In [3]:
from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("US_Baby_Names").getOrCreate()

df = spark.read.options(header=True,inferSchema=True).parquet("hdfs://namenode-master/baby_names_unclean.parquet")

df.printSchema()
df.show()

                                                                                

root
 |-- name: string (nullable = true)
 |-- n: double (nullable = true)
 |-- sex: string (nullable = true)
 |-- year: double (nullable = true)



[Stage 4:>                                                          (0 + 1) / 1]

+----------+-----+---+------+
|      name|    n|sex|  year|
+----------+-----+---+------+
|    Emilia|112.0|  F|1985.0|
|     Kelsi|112.0|  F|1985.0|
|    Margot|112.0|  F|1985.0|
|    Mariam|112.0|  F|1985.0|
|  Scarlett|112.0|  F|1985.0|
|      Aida|111.0|  F|1985.0|
|    Ashlei|111.0|  F|1985.0|
|     Greta|111.0|  F|1985.0|
|    Jaimee|111.0|  F|1985.0|
|     Lorna|111.0|  F|1985.0|
|   Rosario|111.0|  F|1985.0|
|     Sandi|111.0|  F|1985.0|
|   Sharina|111.0|  F|1985.0|
|    Tashia|111.0|  F|1985.0|
|     Adina|110.0|  F|1985.0|
|    Ahsley|110.0|  F|1985.0|
|Alessandra|110.0|  F|1985.0|
|    Amalia|110.0|  F|1985.0|
|    Chelsi|110.0|  F|1985.0|
|    Darcie|110.0|  F|1985.0|
+----------+-----+---+------+
only showing top 20 rows



                                                                                

In [4]:
#1. What are the 10 most popular names for Females in year 2000.

from pyspark.sql import functions as fn

df.filter((df.sex == 'F') & (df.year == 2000))\
   .groupBy(df.name)\
   .agg(fn.sum(df.n).alias('total_count'))\
   .orderBy(fn.desc('total_count'))\
   .limit(10)\
   .select('name', 'total_count')\
   .show()

[Stage 7:>                                                          (0 + 1) / 1]

+---------+-----------+
|     name|total_count|
+---------+-----------+
|    Emily|    25953.0|
|   Hannah|    23080.0|
|  Madison|    19967.0|
|   Ashley|    17997.0|
|    Sarah|    17697.0|
|   Alexis|    17629.0|
| Samantha|    17266.0|
|  Jessica|    15709.0|
|Elizabeth|    15094.0|
|   Taylor|    15078.0|
+---------+-----------+



                                                                                

In [5]:
#2. What are the 10 most popular names for Males in year 2000.
df.filter((df.sex == 'M') & (df.year == 2000))\
   .groupBy(df.name)\
   .agg(fn.sum(df.n).alias('total_count'))\
   .orderBy(fn.desc('total_count'))\
   .limit(10)\
   .select('name', 'total_count')\
   .show()



+-----------+-----------+
|       name|total_count|
+-----------+-----------+
|      Jacob|    34471.0|
|    Michael|    32035.0|
|    Matthew|    28572.0|
|     Joshua|    27538.0|
|Christopher|    24931.0|
|   Nicholas|    24652.0|
|     Andrew|    23639.0|
|     Joseph|    22825.0|
|     Daniel|    22312.0|
|      Tyler|    21503.0|
+-----------+-----------+



                                                                                

In [6]:
#3. Which year had

#a) the most distinct female names
df.filter(df.sex == 'F')\
   .groupBy(df.year)\
  .agg(fn.countDistinct(df.name).alias('count'))\
  .orderBy(fn.desc('count'))\
  .select(df.year)\
  .show(1)
    



+------+
|  year|
+------+
|2007.0|
+------+
only showing top 1 row



                                                                                

In [7]:
#b) the most distinct male names
df.filter(df.sex == 'M')\
   .groupBy(df.year)\
  .agg(fn.countDistinct(df.name).alias('count'))\
  .orderBy(fn.desc('count'))\
  .select(df.year)\
  .show(1)



+------+
|  year|
+------+
|2008.0|
+------+
only showing top 1 row



                                                                                

In [8]:
#c) the most distinct names (both male and female)
df.groupBy(df.year)\
  .agg(fn.countDistinct(df.name).alias('count'))\
  .orderBy(fn.desc('count'))\
  .select(df.year)\
  .show(1)



+------+
|  year|
+------+
|2008.0|
+------+
only showing top 1 row



                                                                                

In [9]:
#4. In the year 2010, how many names where assigned to both males and females.
from pyspark.sql.functions import col
df.filter(df.year== '2010')\
   .groupBy(df.name)\
  .agg(fn.countDistinct(df.sex).alias('count of name assigned to both males and females'))\
  .where(fn.col('count of name assigned to both males and females')== 2)\
  .count()
  

                                                                                

2443

In [10]:
#5. Create a new column that shows the length of each name.
from pyspark.sql.functions import length
df= df.withColumn('length',length(df.name))
df.show()
    


+----------+-----+---+------+------+
|      name|    n|sex|  year|length|
+----------+-----+---+------+------+
|    Emilia|112.0|  F|1985.0|     6|
|     Kelsi|112.0|  F|1985.0|     5|
|    Margot|112.0|  F|1985.0|     6|
|    Mariam|112.0|  F|1985.0|     6|
|  Scarlett|112.0|  F|1985.0|     8|
|      Aida|111.0|  F|1985.0|     4|
|    Ashlei|111.0|  F|1985.0|     6|
|     Greta|111.0|  F|1985.0|     5|
|    Jaimee|111.0|  F|1985.0|     6|
|     Lorna|111.0|  F|1985.0|     5|
|   Rosario|111.0|  F|1985.0|     7|
|     Sandi|111.0|  F|1985.0|     5|
|   Sharina|111.0|  F|1985.0|     7|
|    Tashia|111.0|  F|1985.0|     6|
|     Adina|110.0|  F|1985.0|     5|
|    Ahsley|110.0|  F|1985.0|     6|
|Alessandra|110.0|  F|1985.0|    10|
|    Amalia|110.0|  F|1985.0|     6|
|    Chelsi|110.0|  F|1985.0|     6|
|    Darcie|110.0|  F|1985.0|     6|
+----------+-----+---+------+------+
only showing top 20 rows



In [43]:
#6. Create a new column that shows the total number of times the name have been given to a baby across all years.
from pyspark.sql.functions import sum as _sum
from pyspark.sql.window import Window
df= df.withColumn(
'total_number_of_times',
 _sum(df.n).over(Window.partitionBy(df.name))
)
df.show()

+------+----+---+------+------+---------------------+
|  name|   n|sex|  year|length|total_number_of_times|
+------+----+---+------+------+---------------------+
|  Aada| 5.0|  F|2015.0|     4|                  5.0|
| Aadit|13.0|  M|2003.0|     5|                359.0|
| Aadit|22.0|  M|2004.0|     5|                359.0|
| Aadit|15.0|  M|2005.0|     5|                359.0|
| Aadit|17.0|  M|2006.0|     5|                359.0|
| Aadit|31.0|  M|2007.0|     5|                359.0|
| Aadit|24.0|  M|2008.0|     5|                359.0|
| Aadit|12.0|  M|2009.0|     5|                359.0|
| Aadit|23.0|  M|2010.0|     5|                359.0|
| Aadit|24.0|  M|2011.0|     5|                359.0|
| Aadit|22.0|  M|2012.0|     5|                359.0|
| Aadit|33.0|  M|2013.0|     5|                359.0|
| Aadit|31.0|  M|2014.0|     5|                359.0|
| Aadit|23.0|  M|2015.0|     5|                359.0|
| Aadit|23.0|  M|2016.0|     5|                359.0|
| Aadit|46.0|  M|2017.0|    

In [12]:
#7. Partition your dataframe based on the year the baby was born and write the dataframe to hdfs.
df.write.partitionBy("year").parquet("hdfs://namenode-master/baby_names_clean.parquet")

                                                                                