# US - Baby Names

### Introduction:

We are going to use a subset of [US Baby Names](https://www.kaggle.com/kaggle/us-baby-names) from Kaggle.  
In the file it will be names from 2004 until 2014


### Step 1. Import the necessary libraries

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
import requests

spark = SparkSession.builder.appName("PySparkPrac").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv). 

In [3]:
url = "https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv"
file_path = "/home/neosoft/Documents/Practice/Daily_Practice/21st_Aug/US_Baby_Names_right.csv"

with open(file_path, "wb") as f:
    f.write(requests.get(url).content)

### Step 3. Assign it to a variable called baby_names.

In [9]:
baby_names = spark.read.option("header",True).option("inferSchema",True).csv(file_path)
baby_names = baby_names.drop(baby_names.columns[0])
baby_names.show(5)

+-----+-------+----+------+-----+-----+
|   Id|   Name|Year|Gender|State|Count|
+-----+-------+----+------+-----+-----+
|11350|   Emma|2004|     F|   AK|   62|
|11351|Madison|2004|     F|   AK|   48|
|11352| Hannah|2004|     F|   AK|   46|
|11353|  Grace|2004|     F|   AK|   44|
|11354|  Emily|2004|     F|   AK|   41|
+-----+-------+----+------+-----+-----+
only showing top 5 rows



### Step 4. See the first 10 entries

In [19]:
baby_names.show(10)
baby_names.head(10)

+-----+--------+----+------+-----+-----+
|   Id|    Name|Year|Gender|State|Count|
+-----+--------+----+------+-----+-----+
|11350|    Emma|2004|     F|   AK|   62|
|11351| Madison|2004|     F|   AK|   48|
|11352|  Hannah|2004|     F|   AK|   46|
|11353|   Grace|2004|     F|   AK|   44|
|11354|   Emily|2004|     F|   AK|   41|
|11355| Abigail|2004|     F|   AK|   37|
|11356|  Olivia|2004|     F|   AK|   33|
|11357|Isabella|2004|     F|   AK|   30|
|11358|  Alyssa|2004|     F|   AK|   29|
|11359|  Sophia|2004|     F|   AK|   28|
+-----+--------+----+------+-----+-----+
only showing top 10 rows



[Row(Id=11350, Name='Emma', Year=2004, Gender='F', State='AK', Count=62),
 Row(Id=11351, Name='Madison', Year=2004, Gender='F', State='AK', Count=48),
 Row(Id=11352, Name='Hannah', Year=2004, Gender='F', State='AK', Count=46),
 Row(Id=11353, Name='Grace', Year=2004, Gender='F', State='AK', Count=44),
 Row(Id=11354, Name='Emily', Year=2004, Gender='F', State='AK', Count=41),
 Row(Id=11355, Name='Abigail', Year=2004, Gender='F', State='AK', Count=37),
 Row(Id=11356, Name='Olivia', Year=2004, Gender='F', State='AK', Count=33),
 Row(Id=11357, Name='Isabella', Year=2004, Gender='F', State='AK', Count=30),
 Row(Id=11358, Name='Alyssa', Year=2004, Gender='F', State='AK', Count=29),
 Row(Id=11359, Name='Sophia', Year=2004, Gender='F', State='AK', Count=28)]

### Step 5. Delete the column 'Unnamed: 0' and 'Id'

In [18]:
baby_names.drop("Id").show()

+---------+----+------+-----+-----+
|     Name|Year|Gender|State|Count|
+---------+----+------+-----+-----+
|     Emma|2004|     F|   AK|   62|
|  Madison|2004|     F|   AK|   48|
|   Hannah|2004|     F|   AK|   46|
|    Grace|2004|     F|   AK|   44|
|    Emily|2004|     F|   AK|   41|
|  Abigail|2004|     F|   AK|   37|
|   Olivia|2004|     F|   AK|   33|
| Isabella|2004|     F|   AK|   30|
|   Alyssa|2004|     F|   AK|   29|
|   Sophia|2004|     F|   AK|   28|
|   Alexis|2004|     F|   AK|   27|
|Elizabeth|2004|     F|   AK|   27|
|   Hailey|2004|     F|   AK|   27|
|     Anna|2004|     F|   AK|   26|
|  Natalie|2004|     F|   AK|   25|
|    Sarah|2004|     F|   AK|   25|
|   Sydney|2004|     F|   AK|   25|
|      Ava|2004|     F|   AK|   23|
|  Trinity|2004|     F|   AK|   22|
|    Haley|2004|     F|   AK|   21|
+---------+----+------+-----+-----+
only showing top 20 rows



### Step 6. Is there more male or female names in the dataset?

In [23]:
count_names = baby_names.groupBy("Gender").count()
count_names.orderBy(col("count").desc()).first()

Row(Gender='F', count=558846)

### Step 7. Group the dataset by name and assign to names

In [26]:
names = baby_names.groupBy("Name").agg(sum("Count").alias("Total_count"))
names.show(5)



+------+-----------+
|  Name|Total_count|
+------+-----------+
| Kiana|       5965|
|Alayna|      14171|
| Ember|       3181|
| Tyler|     129989|
|Maddox|      20716|
+------+-----------+
only showing top 5 rows



                                                                                

### Step 8. How many different names exist in the dataset?

In [28]:
names.distinct().count()

17632

### Step 9. What is the name with most occurrences?

In [29]:
names.orderBy(col("Total_count").desc()).first()

Row(Name='Jacob', Total_count=242874)

### Step 10. How many different names have the least occurrences?

In [32]:
min_count = names.agg(min("Total_count")).first()[0]
least_common_names_count = names.filter(names.Total_count == min_count).count()
least_common_names_count

2578

### Step 11. What is the median name occurrence?

In [37]:
median = names.approxQuantile("Total_count", [0.5], 0.01)[0]
median

48.0

### Step 12. What is the standard deviation of names?

In [34]:

std_dev = names.select(stddev("Total_count")).first()[0]
std_dev

11006.069467890562

### Step 13. Get a summary with the mean, min, max, std and quartiles.

In [36]:
summary = names.select("Total_count").describe()
summary.show()

quartiles = names.approxQuantile("Total_count", [0.25, 0.5, 0.75], 0.01)
quartiles


+-------+------------------+
|summary|       Total_count|
+-------+------------------+
|  count|             17632|
|   mean| 2008.932168784029|
| stddev|11006.069467890562|
|    min|                 5|
|    max|            242874|
+-------+------------------+



[10.0, 48.0, 316.0]